Big Data Analytics
eBook - ePub

Big Data Analytics

Venkat Ankam

Compartir libro
  1. 326 páginas
  2. English
  3. ePUB (apto para móviles)
  4. Disponible en iOS y Android
eBook - ePub

Big Data Analytics

Venkat Ankam

Detalles del libro
Vista previa del libro
Índice
Citas

Información del libro

A handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Hadoop clustersAbout This Book• This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools.• Learn all Spark stack components including latest topics such as DataFrames, DataSets, GraphFrames, Structured Streaming, DataFrame based ML Pipelines and SparkR.• Integrations with frameworks such as HDFS, YARN and tools such as Jupyter, Zeppelin, NiFi, Mahout, HBase Spark Connector, GraphFrames, H2O and Hivemall.Who This Book Is ForThough this book is primarily aimed at data analysts and data scientists, it will also help architects, programmers, and practitioners. Knowledge of either Spark or Hadoop would be beneficial. It is assumed that you have basic programming background in Scala, Python, SQL, or R programming with basic Linux experience. Working experience within big data environments is not mandatory.What You Will Learn• Find out and implement the tools and techniques of big data analytics using Spark on Hadoop clusters with wide variety of tools used with Spark and Hadoop• Understand all the Hadoop and Spark ecosystem components• Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Conventional and Structured Streaming, MLLib, ML Pipelines and Graphx• See batch and real-time data analytics using Spark Core, Spark SQL, and Conventional and Structured Streaming• Get to grips with data science and machine learning using MLLib, ML Pipelines, H2O, Hivemall, Graphx, SparkR and Hivemall.In DetailBig Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components – Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components – HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters.It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark.Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data.Style and approachThis step-by-step pragmatic guide will make life easy no matter what your level of experience. You will deep dive into Apache Spark on Hadoop clusters through ample exciting real-life examples. Practical tutorial explains data science in simple terms to help programmers and data analysts get started with Data Science

Preguntas frecuentes

¿Cómo cancelo mi suscripción?
Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.
¿Cómo descargo los libros?
Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.
¿En qué se diferencian los planes de precios?
Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.
¿Qué es Perlego?
Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.
¿Perlego ofrece la función de texto a voz?
Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.
¿Es Big Data Analytics un PDF/ePUB en línea?
Sí, puedes acceder a Big Data Analytics de Venkat Ankam en formato PDF o ePUB, así como a otros libros populares de Informatica y Archiviazione di dati. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Año
2016
ISBN
9781785889707
Edición
1
Categoría
Informatica

Big Data Analytics


Table of Contents

Big Data Analytics
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Big Data Analytics at a 10,000-Foot View
Big Data analytics and the role of Hadoop and Spark
A typical Big Data analytics project life cycle
Identifying the problem and outcomes
Identifying the necessary data
Data collection
Preprocessing data and ETL
Performing analytics
Visualizing data
The role of Hadoop and Spark
Big Data science and the role of Hadoop and Spark
A fundamental shift from data analytics to data science
Data scientists versus software engineers
Data scientists versus data analysts
Data scientists versus business analysts
A typical data science project life cycle
Hypothesis and modeling
Measuring the effectiveness
Making improvements
Communicating the results
The role of Hadoop and Spark
Tools and techniques
Real-life use cases
Summary
2. Getting Started with Apache Hadoop and Apache Spark
Introducing Apache Hadoop
Hadoop Distributed File System
Features of HDFS
MapReduce
MapReduce features
MapReduce v1 versus MapReduce v2
MapReduce v1 challenges
YARN
Storage options on Hadoop
File formats
Sequence file
Protocol buffers and thrift
Avro
Parquet
RCFile and ORCFile
Compression formats
Standard compression formats
Introducing Apache Spark
Spark history
What is Apache Spark?
What Apache Spark is not
MapReduce issues
Spark's stack
Why Hadoop plus Spark?
Hadoop features
Spark features
Frequently asked questions about Spark
Installing Hadoop plus Spark clusters
Summary
3. Deep Dive into Apache Spark
Starting Spark daemons
Working with CDH
Working with HDP, MapR, and Spark pre-built packages
Learning Spark core concepts
Ways to work with Spark
Spark Shell
Exploring the Spark Scala shell
Spark applications
Connecting to the Kerberos Security Enabled Spark Cluster
Resilient Distributed Dataset
Method 1 – parallelizing a collection
Method 2 – reading from a file
Reading files from HDFS
Reading files from HDFS with HA enabled
Spark context
Transformations and actions
Parallelism in RDDs
Lazy evaluation
Lineage Graph
Serialization
Leveraging Hadoop file formats in Spark
Data locality
Shared variables
Pair RDDs
Lifecycle of Spark program
Pipelining
Spark execution summary
Spark applications
Spark Shell versus Spark applications
Creating a Spark context
SparkConf
SparkSubmit
Spark Conf precedence order
Important application configurations
Persistence and caching
Storage levels
What level to choose?
Spark resource managers – Standalone, YARN, and Mesos
Local versus cluster mode
Cluster resource managers
Standalone
YARN
Dynamic resource allocation
Client mode versus cluster mode
Mesos
Which resource manager to use?
Summary
4. Big Data Analytics with Spark SQL, DataFrames, and Datasets
History of Spark SQL
Architecture of Spark SQL
Introducing SQL, Datasources, DataFrame, and Dataset APIs
Evolution of DataFrames and Datasets
What's wrong with RDDs?
RDD Transformations versus Dataset and DataFrames Transformations
Why Datasets and DataFrames?
Optimization
Speed
Automatic Schema Discovery
Multiple sources, multiple languages
Interoperability between RDDs and others
Select and read necessary data only
When to use RDDs, Datasets, and DataFrames?
Analytics with DataFrames
Creating SparkSession
Creating DataFrames
Creating DataFrames from structured data files
Creating DataFrames from RDDs
Creating DataFrames from tables in Hive
Creating DataFrames from external databases
Converting DataFrames to RDDs
Common Dataset/DataFrame operations
Input and Output Operations
Basic Dataset/DataFrame functions
DSL functions
Built-in functions, aggregate functions, and window functions
Actions
RDD operations
Caching data
Performance optimizations
Analytics with the Dataset API
Creating Datasets
Converting a DataFrame to a Dataset
Converting a Dataset to a DataFrame
Accessing metadata using Catalog
Data Sources API
Read and write functions
Built-in sources
Working with text files
Working with JSON
Working with Parquet
Working with ORC
Working with JDBC
Working with CSV
External sources
Working with AVRO
Working with XML
Working with Pandas
DataFrame based Spark-on-HBase connector
Spark SQL as a distribu...

Índice