Big Data Analytics with Java
Table of Contents
Big Data Analytics with Java
Credits
About the Author
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Big Data Analytics with Java
Why data analytics on big data?
Big data for analytics
Big data – a bigger pay package for Java developers
Basics of Hadoop – a Java sub-project
Distributed computing on Hadoop
HDFS concepts
Design and architecture of HDFS
Main components of HDFS
HDFS simple commands
Apache Spark
Concepts
Transformations
Actions
Spark Java API
Spark samples using Java 8
Loading data
Data operations – cleansing and munging
Analyzing data – count, projection, grouping, aggregation, and max/min
Actions on RDDs
Paired RDDs
Transformations on paired RDDs
Saving data
Collecting and printing results
Executing Spark programs on Hadoop
Apache Spark sub-projects
Spark machine learning modules
MLlib Java API
Other machine learning libraries
Mahout – a popular Java ML library
Deeplearning4j – a deep learning library
Compressing data
Avro and Parquet
Summary
2. First Steps in Data Analysis
Datasets
Data cleaning and munging
Basic analysis of data with Spark SQL
Building SparkConf and context
Dataframe and datasets
Load and parse data
Analyzing data – the Spark-SQL way
Spark SQL for data exploration and analytics
Market basket analysis – Apriori algorithm
Full Apriori algorithm
Implementation of the Apriori algorithm in Apache Spark
Efficient market basket analysis using FP-Growth algorithm
Running FP-Growth on Apache Spark
Summary
3. Data Visualization
Data visualization with Java JFreeChart
Using charts in big data analytics
Time Series chart
All India seasonal and annual average temperature series dataset
Simple single Time Series chart
Multiple Time Series on a single chart window
Bar charts
Histograms
When would you use a histogram?
How to make histograms using JFreeChart?
Line charts
Scatter plots
Box plots
Advanced visualization technique
Prefuse
IVTK Graph toolkit
Other libraries
Summary
4. Basics of Machine Learning
What is machine learning?
Real-life examples of machine learning
Type of machine learning
A small sample case study of supervised and unsupervised learning
Steps for machine learning problems
Choosing the machine learning model
What are the feature types that can be extracted from the datasets?
How do you select the best features to train your models?
How do you run machine learning analytics on big data?
Getting and preparing data in Hadoop
Preparing the data
Formatting the data
Storing the data
Training and storing models on big data
Apache Spark machine learning API
The new Spark ML API
Summary
5. Regression on Big Data
Linear regression
What is simple linear regression?
Where is linear regression used?
Predicting house prices using linear regression
Dataset
Data cleaning and munging
Exploring the dataset
Running and testing the linear regression model
Logistic regression
Which mathematical functions does logistic regression use?
Where is logistic regression used?
Predicting heart disease using logistic regression
Dataset
Data cleaning and munging
Data exploration
Running and testing the logistic regression model
Summary
6. Naive Bayes and Sentiment Analysis
Conditional probability
Bayes theorem
Naive Bayes algorithm
Advantages of Naive Bayes
Disadvantages of Naive Bayes
Sentimental analysis
Concepts for sentimental analysis
Tokenization
Stop words removal
Stemming
N-grams
Term presence and Term Frequency
TF-IDF
Bag of words
Dataset
Data exploration of text data
Sentimental analysis on this dataset
SVM or Support Vector Machine
Summary
7. Decision Trees
What is a decision tree?
Building a decision tree
Choosing the best features for splitting the datasets
Advantages of using decision trees
Disadvantages of using decision trees
Dataset
Data exploration
Cleaning and munging the data
Training and testing the model
Summary
8. Ensembling on Big Data
Ensembling
Types of ensembling
Bagging
Boosting
Advantages and disadvantages of ensembling
Random forests
Gradient boosted trees (GBTs)
Classification problem and dataset used
Data exploration
Training and testing our random forest model
Training and testing our gradient boosted tree model
Summary
9. Recommendation Systems
Recommendation systems and their types
Content-based recommendation systems
Dataset
Content-based recommender on MovieLens dataset
Collaborative recommendation systems
Advantages
Disadvantages
Alternating least square – collaborative filtering
Summary
10. Clustering and Customer Segmentation on Big Data
Clustering
Types of clustering
Hierarchical clustering
K-means clustering
Bisecting k-means clustering
Customer segmentation
Dataset
Data exploration
Clustering for customer segmentation
Changing the clustering algorithm
Summary
11. Massive Graphs on Big Data
Refresher on graphs
Representing graphs
Common terminology on graphs
Common algorithms on graphs
Plotting graphs
Massive graphs on big data
Graph analytics
GraphFrames
Building a graph using GraphFrames
Graph analytics on airports and their flights
Datasets
Graph analytics on flights data
Summary
12. Real-Time Analytics on Big Data
Real-time analytics
Big data stack for real-time analytics
Real-time SQL queries on big data
Real-time data ingestion and storage
Real-time data processing
Real-time SQL queries using Impala
Flight delay analysis using Impala
Apache Kafka
Spark Streaming
Typical uses of Spark Streaming
Base project setup
Trending videos
Sentiment analysis in real time
Summary
13. Deep Learning Using Big Data
Introduction to neural networks
Perceptron
Problems with perceptrons
Sigmoid neuron
Multi-layer perceptrons
Accuracy of mult...