Big Data Analytics
eBook - ePub

Big Data Analytics

Venkat Ankam

Share book
  1. 326 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Big Data Analytics

Venkat Ankam

Book details
Book preview
Table of contents
Citations

About This Book

A handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Hadoop clustersAbout This Book• This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools.• Learn all Spark stack components including latest topics such as DataFrames, DataSets, GraphFrames, Structured Streaming, DataFrame based ML Pipelines and SparkR.• Integrations with frameworks such as HDFS, YARN and tools such as Jupyter, Zeppelin, NiFi, Mahout, HBase Spark Connector, GraphFrames, H2O and Hivemall.Who This Book Is ForThough this book is primarily aimed at data analysts and data scientists, it will also help architects, programmers, and practitioners. Knowledge of either Spark or Hadoop would be beneficial. It is assumed that you have basic programming background in Scala, Python, SQL, or R programming with basic Linux experience. Working experience within big data environments is not mandatory.What You Will Learn• Find out and implement the tools and techniques of big data analytics using Spark on Hadoop clusters with wide variety of tools used with Spark and Hadoop• Understand all the Hadoop and Spark ecosystem components• Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Conventional and Structured Streaming, MLLib, ML Pipelines and Graphx• See batch and real-time data analytics using Spark Core, Spark SQL, and Conventional and Structured Streaming• Get to grips with data science and machine learning using MLLib, ML Pipelines, H2O, Hivemall, Graphx, SparkR and Hivemall.In DetailBig Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components – Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components – HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters.It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark.Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data.Style and approachThis step-by-step pragmatic guide will make life easy no matter what your level of experience. You will deep dive into Apache Spark on Hadoop clusters through ample exciting real-life examples. Practical tutorial explains data science in simple terms to help programmers and data analysts get started with Data Science

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Big Data Analytics an online PDF/ePUB?
Yes, you can access Big Data Analytics by Venkat Ankam in PDF and/or ePUB format, as well as other popular books in Informatica & Archiviazione di dati. We have over one million books available in our catalogue for you to explore.

Information

Year
2016
ISBN
9781785889707

Big Data Analytics


Table of Contents

Big Data Analytics
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Big Data Analytics at a 10,000-Foot View
Big Data analytics and the role of Hadoop and Spark
A typical Big Data analytics project life cycle
Identifying the problem and outcomes
Identifying the necessary data
Data collection
Preprocessing data and ETL
Performing analytics
Visualizing data
The role of Hadoop and Spark
Big Data science and the role of Hadoop and Spark
A fundamental shift from data analytics to data science
Data scientists versus software engineers
Data scientists versus data analysts
Data scientists versus business analysts
A typical data science project life cycle
Hypothesis and modeling
Measuring the effectiveness
Making improvements
Communicating the results
The role of Hadoop and Spark
Tools and techniques
Real-life use cases
Summary
2. Getting Started with Apache Hadoop and Apache Spark
Introducing Apache Hadoop
Hadoop Distributed File System
Features of HDFS
MapReduce
MapReduce features
MapReduce v1 versus MapReduce v2
MapReduce v1 challenges
YARN
Storage options on Hadoop
File formats
Sequence file
Protocol buffers and thrift
Avro
Parquet
RCFile and ORCFile
Compression formats
Standard compression formats
Introducing Apache Spark
Spark history
What is Apache Spark?
What Apache Spark is not
MapReduce issues
Spark's stack
Why Hadoop plus Spark?
Hadoop features
Spark features
Frequently asked questions about Spark
Installing Hadoop plus Spark clusters
Summary
3. Deep Dive into Apache Spark
Starting Spark daemons
Working with CDH
Working with HDP, MapR, and Spark pre-built packages
Learning Spark core concepts
Ways to work with Spark
Spark Shell
Exploring the Spark Scala shell
Spark applications
Connecting to the Kerberos Security Enabled Spark Cluster
Resilient Distributed Dataset
Method 1 – parallelizing a collection
Method 2 – reading from a file
Reading files from HDFS
Reading files from HDFS with HA enabled
Spark context
Transformations and actions
Parallelism in RDDs
Lazy evaluation
Lineage Graph
Serialization
Leveraging Hadoop file formats in Spark
Data locality
Shared variables
Pair RDDs
Lifecycle of Spark program
Pipelining
Spark execution summary
Spark applications
Spark Shell versus Spark applications
Creating a Spark context
SparkConf
SparkSubmit
Spark Conf precedence order
Important application configurations
Persistence and caching
Storage levels
What level to choose?
Spark resource managers – Standalone, YARN, and Mesos
Local versus cluster mode
Cluster resource managers
Standalone
YARN
Dynamic resource allocation
Client mode versus cluster mode
Mesos
Which resource manager to use?
Summary
4. Big Data Analytics with Spark SQL, DataFrames, and Datasets
History of Spark SQL
Architecture of Spark SQL
Introducing SQL, Datasources, DataFrame, and Dataset APIs
Evolution of DataFrames and Datasets
What's wrong with RDDs?
RDD Transformations versus Dataset and DataFrames Transformations
Why Datasets and DataFrames?
Optimization
Speed
Automatic Schema Discovery
Multiple sources, multiple languages
Interoperability between RDDs and others
Select and read necessary data only
When to use RDDs, Datasets, and DataFrames?
Analytics with DataFrames
Creating SparkSession
Creating DataFrames
Creating DataFrames from structured data files
Creating DataFrames from RDDs
Creating DataFrames from tables in Hive
Creating DataFrames from external databases
Converting DataFrames to RDDs
Common Dataset/DataFrame operations
Input and Output Operations
Basic Dataset/DataFrame functions
DSL functions
Built-in functions, aggregate functions, and window functions
Actions
RDD operations
Caching data
Performance optimizations
Analytics with the Dataset API
Creating Datasets
Converting a DataFrame to a Dataset
Converting a Dataset to a DataFrame
Accessing metadata using Catalog
Data Sources API
Read and write functions
Built-in sources
Working with text files
Working with JSON
Working with Parquet
Working with ORC
Working with JDBC
Working with CSV
External sources
Working with AVRO
Working with XML
Working with Pandas
DataFrame based Spark-on-HBase connector
Spark SQL as a distribu...

Table of contents