eBook - ePub

Big Data Analytics

Name: Big Data Analytics
Author: Venkat Ankam

Venkat Ankam

Compartir libro

326 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Big Data Analytics

Venkat Ankam

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

A handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Hadoop clustersAbout This Book• This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools.• Learn all Spark stack components including latest topics such as DataFrames, DataSets, GraphFrames, Structured Streaming, DataFrame based ML Pipelines and SparkR.• Integrations with frameworks such as HDFS, YARN and tools such as Jupyter, Zeppelin, NiFi, Mahout, HBase Spark Connector, GraphFrames, H2O and Hivemall.Who This Book Is ForThough this book is primarily aimed at data analysts and data scientists, it will also help architects, programmers, and practitioners. Knowledge of either Spark or Hadoop would be beneficial. It is assumed that you have basic programming background in Scala, Python, SQL, or R programming with basic Linux experience. Working experience within big data environments is not mandatory.What You Will Learn• Find out and implement the tools and techniques of big data analytics using Spark on Hadoop clusters with wide variety of tools used with Spark and Hadoop• Understand all the Hadoop and Spark ecosystem components• Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Conventional and Structured Streaming, MLLib, ML Pipelines and Graphx• See batch and real-time data analytics using Spark Core, Spark SQL, and Conventional and Structured Streaming• Get to grips with data science and machine learning using MLLib, ML Pipelines, H2O, Hivemall, Graphx, SparkR and Hivemall.In DetailBig Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components – Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components – HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters.It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark.Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data.Style and approachThis step-by-step pragmatic guide will make life easy no matter what your level of experience. You will deep dive into Apache Spark on Hadoop clusters through ample exciting real-life examples. Practical tutorial explains data science in simple terms to help programmers and data analysts get started with Data Science

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Big Data Analytics un PDF/ePUB en línea?

Sí, puedes acceder a Big Data Analytics de Venkat Ankam en formato PDF o ePUB, así como a otros libros populares de Informatica y Archiviazione di dati. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Packt Publishing

Año

2016

ISBN

9781785889707

Edición

Categoría

Informatica

Categoría

Archiviazione di dati

Big Data Analytics

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Big Data Analytics at a 10,000-Foot View

Big Data analytics and the role of Hadoop and Spark

A typical Big Data analytics project life cycle

Identifying the problem and outcomes

Identifying the necessary data

Data collection

Preprocessing data and ETL

Performing analytics

Visualizing data

The role of Hadoop and Spark

Big Data science and the role of Hadoop and Spark

A fundamental shift from data analytics to data science

Data scientists versus software engineers

Data scientists versus data analysts

Data scientists versus business analysts

A typical data science project life cycle

Hypothesis and modeling

Measuring the effectiveness

Making improvements

Communicating the results

The role of Hadoop and Spark

Tools and techniques

Real-life use cases

Summary

2. Getting Started with Apache Hadoop and Apache Spark

Introducing Apache Hadoop

Hadoop Distributed File System

Features of HDFS

MapReduce

MapReduce features

MapReduce v1 versus MapReduce v2

MapReduce v1 challenges

YARN

Storage options on Hadoop

File formats

Sequence file

Protocol buffers and thrift

Avro

Parquet

RCFile and ORCFile

Compression formats

Standard compression formats

Introducing Apache Spark

Spark history

What is Apache Spark?

What Apache Spark is not

MapReduce issues

Spark's stack

Why Hadoop plus Spark?

Hadoop features

Spark features

Frequently asked questions about Spark

Installing Hadoop plus Spark clusters

Summary

3. Deep Dive into Apache Spark

Starting Spark daemons

Working with CDH

Working with HDP, MapR, and Spark pre-built packages

Learning Spark core concepts

Ways to work with Spark

Spark Shell

Exploring the Spark Scala shell

Spark applications

Connecting to the Kerberos Security Enabled Spark Cluster

Resilient Distributed Dataset

Method 1 – parallelizing a collection

Method 2 – reading from a file

Reading files from HDFS

Reading files from HDFS with HA enabled

Spark context

Transformations and actions

Parallelism in RDDs

Lazy evaluation

Lineage Graph

Serialization

Leveraging Hadoop file formats in Spark

Data locality

Shared variables

Pair RDDs

Lifecycle of Spark program

Pipelining

Spark execution summary

Spark applications

Spark Shell versus Spark applications

Creating a Spark context

SparkConf

SparkSubmit

Spark Conf precedence order

Important application configurations

Persistence and caching

Storage levels

What level to choose?

Spark resource managers – Standalone, YARN, and Mesos

Local versus cluster mode

Cluster resource managers

Standalone

YARN

Dynamic resource allocation

Client mode versus cluster mode

Mesos

Which resource manager to use?

Summary

4. Big Data Analytics with Spark SQL, DataFrames, and Datasets

History of Spark SQL

Architecture of Spark SQL

Introducing SQL, Datasources, DataFrame, and Dataset APIs

Evolution of DataFrames and Datasets

What's wrong with RDDs?

RDD Transformations versus Dataset and DataFrames Transformations

Why Datasets and DataFrames?

Optimization

Speed

Automatic Schema Discovery

Multiple sources, multiple languages

Interoperability between RDDs and others

Select and read necessary data only

When to use RDDs, Datasets, and DataFrames?

Analytics with DataFrames

Creating SparkSession

Creating DataFrames

Creating DataFrames from structured data files

Creating DataFrames from RDDs

Creating DataFrames from tables in Hive

Creating DataFrames from external databases

Converting DataFrames to RDDs

Common Dataset/DataFrame operations

Input and Output Operations

Basic Dataset/DataFrame functions

DSL functions

Built-in functions, aggregate functions, and window functions

Actions

RDD operations

Caching data

Performance optimizations

Analytics with the Dataset API

Creating Datasets

Converting a DataFrame to a Dataset

Converting a Dataset to a DataFrame

Accessing metadata using Catalog

Data Sources API

Read and write functions

Built-in sources

Working with text files

Working with JSON

Working with Parquet

Working with ORC

Working with JDBC

Working with CSV

External sources

Working with AVRO

Working with XML

Working with Pandas

DataFrame based Spark-on-HBase connector

Spark SQL as a distribu...

Información del libro

Preguntas frecuentes

Información

Big Data Analytics

Table of Contents

Índice