eBook - ePub

Big Data Analytics

Name: Big Data Analytics
Author: Venkat Ankam

Venkat Ankam

326 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Big Data Analytics

Venkat Ankam

Book details

Book preview

Table of contents

Citations

About This Book

A handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Hadoop clustersAbout This Book• This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools.• Learn all Spark stack components including latest topics such as DataFrames, DataSets, GraphFrames, Structured Streaming, DataFrame based ML Pipelines and SparkR.• Integrations with frameworks such as HDFS, YARN and tools such as Jupyter, Zeppelin, NiFi, Mahout, HBase Spark Connector, GraphFrames, H2O and Hivemall.Who This Book Is ForThough this book is primarily aimed at data analysts and data scientists, it will also help architects, programmers, and practitioners. Knowledge of either Spark or Hadoop would be beneficial. It is assumed that you have basic programming background in Scala, Python, SQL, or R programming with basic Linux experience. Working experience within big data environments is not mandatory.What You Will Learn• Find out and implement the tools and techniques of big data analytics using Spark on Hadoop clusters with wide variety of tools used with Spark and Hadoop• Understand all the Hadoop and Spark ecosystem components• Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Conventional and Structured Streaming, MLLib, ML Pipelines and Graphx• See batch and real-time data analytics using Spark Core, Spark SQL, and Conventional and Structured Streaming• Get to grips with data science and machine learning using MLLib, ML Pipelines, H2O, Hivemall, Graphx, SparkR and Hivemall.In DetailBig Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components – Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components – HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters.It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark.Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data.Style and approachThis step-by-step pragmatic guide will make life easy no matter what your level of experience. You will deep dive into Apache Spark on Hadoop clusters through ample exciting real-life examples. Practical tutorial explains data science in simple terms to help programmers and data analysts get started with Data Science

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Big Data Analytics an online PDF/ePUB?

Yes, you can access Big Data Analytics by Venkat Ankam in PDF and/or ePUB format, as well as other popular books in Informatik & Data-Warehousing. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Packt Publishing

Year

2016

ISBN

9781785889707

Edition

Topic

Informatik

Subtopic

Data-Warehousing

Big Data Analytics

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Big Data Analytics at a 10,000-Foot View

Big Data analytics and the role of Hadoop and Spark

A typical Big Data analytics project life cycle

Identifying the problem and outcomes

Identifying the necessary data

Data collection

Preprocessing data and ETL

Performing analytics

Visualizing data

The role of Hadoop and Spark

Big Data science and the role of Hadoop and Spark

A fundamental shift from data analytics to data science

Data scientists versus software engineers

Data scientists versus data analysts

Data scientists versus business analysts

A typical data science project life cycle

Hypothesis and modeling

Measuring the effectiveness

Making improvements

Communicating the results

The role of Hadoop and Spark

Tools and techniques

Real-life use cases

Summary

2. Getting Started with Apache Hadoop and Apache Spark

Introducing Apache Hadoop

Hadoop Distributed File System

Features of HDFS

MapReduce

MapReduce features

MapReduce v1 versus MapReduce v2

MapReduce v1 challenges

YARN

Storage options on Hadoop

File formats

Sequence file

Protocol buffers and thrift

Avro

Parquet

RCFile and ORCFile

Compression formats

Standard compression formats

Introducing Apache Spark

Spark history

What is Apache Spark?

What Apache Spark is not

MapReduce issues

Spark's stack

Why Hadoop plus Spark?

Hadoop features

Spark features

Frequently asked questions about Spark

Installing Hadoop plus Spark clusters

Summary

3. Deep Dive into Apache Spark

Starting Spark daemons

Working with CDH

Working with HDP, MapR, and Spark pre-built packages

Learning Spark core concepts

Ways to work with Spark

Spark Shell

Exploring the Spark Scala shell

Spark applications

Connecting to the Kerberos Security Enabled Spark Cluster

Resilient Distributed Dataset

Method 1 – parallelizing a collection

Method 2 – reading from a file

Reading files from HDFS

Reading files from HDFS with HA enabled

Spark context

Transformations and actions

Parallelism in RDDs

Lazy evaluation

Lineage Graph

Serialization

Leveraging Hadoop file formats in Spark

Data locality

Shared variables

Pair RDDs

Lifecycle of Spark program

Pipelining

Spark execution summary

Spark applications

Spark Shell versus Spark applications

Creating a Spark context

SparkConf

SparkSubmit

Spark Conf precedence order

Important application configurations

Persistence and caching

Storage levels

What level to choose?

Spark resource managers – Standalone, YARN, and Mesos

Local versus cluster mode

Cluster resource managers

Standalone

YARN

Dynamic resource allocation

Client mode versus cluster mode

Mesos

Which resource manager to use?

Summary

4. Big Data Analytics with Spark SQL, DataFrames, and Datasets

History of Spark SQL

Architecture of Spark SQL

Introducing SQL, Datasources, DataFrame, and Dataset APIs

Evolution of DataFrames and Datasets

What's wrong with RDDs?

RDD Transformations versus Dataset and DataFrames Transformations

Why Datasets and DataFrames?

Optimization

Speed

Automatic Schema Discovery

Multiple sources, multiple languages

Interoperability between RDDs and others

Select and read necessary data only

When to use RDDs, Datasets, and DataFrames?

Analytics with DataFrames

Creating SparkSession

Creating DataFrames

Creating DataFrames from structured data files

Creating DataFrames from RDDs

Creating DataFrames from tables in Hive

Creating DataFrames from external databases

Converting DataFrames to RDDs

Common Dataset/DataFrame operations

Input and Output Operations

Basic Dataset/DataFrame functions

DSL functions

Built-in functions, aggregate functions, and window functions

Actions

RDD operations

Caching data

Performance optimizations

Analytics with the Dataset API

Creating Datasets

Converting a DataFrame to a Dataset

Converting a Dataset to a DataFrame

Accessing metadata using Catalog

Data Sources API

Read and write functions

Built-in sources

Working with text files

Working with JSON

Working with Parquet

Working with ORC

Working with JDBC

Working with CSV

External sources

Working with AVRO

Working with XML

Working with Pandas

DataFrame based Spark-on-HBase connector

Spark SQL as a distribu...

Big Data Analytics

About This Book

Frequently asked questions

Information

Big Data Analytics

Table of Contents

Table of contents