Learning Hadoop 2
eBook - ePub

Learning Hadoop 2

Garry Turkington, Gabriele Modena

Compartir libro
  1. 382 páginas
  2. English
  3. ePUB (apto para móviles)
  4. Disponible en iOS y Android
eBook - ePub

Learning Hadoop 2

Garry Turkington, Gabriele Modena

Detalles del libro
Vista previa del libro
Índice
Citas

Preguntas frecuentes

¿Cómo cancelo mi suscripción?
Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.
¿Cómo descargo los libros?
Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.
¿En qué se diferencian los planes de precios?
Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.
¿Qué es Perlego?
Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.
¿Perlego ofrece la función de texto a voz?
Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.
¿Es Learning Hadoop 2 un PDF/ePUB en línea?
Sí, puedes acceder a Learning Hadoop 2 de Garry Turkington, Gabriele Modena en formato PDF o ePUB, así como a otros libros populares de Informatique y Bases de données. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Año
2015
ISBN
9781783285518
Categoría
Informatique

Learning Hadoop 2


Table of Contents

Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introduction
A note on versioning
The background of Hadoop
Components of Hadoop
Common building blocks
Storage
Computation
Better together
Hadoop 2 – what's the big deal?
Storage in Hadoop 2
Computation in Hadoop 2
Distributions of Apache Hadoop
A dual approach
AWS – infrastructure on demand from Amazon
Simple Storage Service (S3)
Elastic MapReduce (EMR)
Getting started
Cloudera QuickStart VM
Amazon EMR
Creating an AWS account
Signing up for the necessary services
Using Elastic MapReduce
Getting Hadoop up and running
How to use EMR
AWS credentials
The AWS command-line interface
Running the examples
Data processing with Hadoop
Why Twitter?
Building our first dataset
One service, multiple APIs
Anatomy of a Tweet
Twitter credentials
Programmatic access with Python
Summary
2. Storage
The inner workings of HDFS
Cluster startup
NameNode startup
DataNode startup
Block replication
Command-line access to the HDFS filesystem
Exploring the HDFS filesystem
Protecting the filesystem metadata
Secondary NameNode not to the rescue
Hadoop 2 NameNode HA
Keeping the HA NameNodes in sync
Client configuration
How a failover works
Apache ZooKeeper – a different type of filesystem
Implementing a distributed lock with sequential ZNodes
Implementing group membership and leader election using ephemeral ZNodes
Java API
Building blocks
Further reading
Automatic NameNode failover
HDFS snapshots
Hadoop filesystems
Hadoop interfaces
Java FileSystem API
Libhdfs
Thrift
Managing and serializing data
The Writable interface
Introducing the wrapper classes
Array wrapper classes
The Comparable and WritableComparable interfaces
Storing data
Serialization and Containers
Compression
General-purpose file formats
Column-oriented data formats
RCFile
ORC
Parquet
Avro
Using the Java API
Summary
3. Processing – MapReduce and Beyond
MapReduce
Java API to MapReduce
The Mapper class
The Reducer class
The Driver class
Combiner
Partitioning
The optional partition function
Hadoop-provided mapper and reducer implementations
Sharing reference data
Writing MapReduce programs
Getting started
Running the examples
Local cluster
Elastic MapReduce
WordCount, the Hello World of MapReduce
Word co-occurrences
Trending topics
The Top N pattern
Sentiment of hashtags
Text cleanup using chain mapper
Walking through a run of a MapReduce job
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reducer input
Reducer input
Reducer execution
Reducer output
Shutdown
Input/Output
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
OutputFormat and RecordWriter
Hadoop-provided OutputFormat
Sequence files
YARN
YARN architecture
The components of YARN
Anatomy of a YARN application
Life cycle of a YARN application
Fault tolerance and monitoring
Thinking in layers
Execution models
YARN in the real world – Computation beyond MapReduce
The problem with MapReduce
Tez
Hive-on-tez
Apache Spark
Apache Samza
YARN-independent frameworks
YARN today and beyond
Summary
4. Real-time Computation with Samza
Stream processing with Samza
How Samza works
Samza high-level architecture
Samza's best friend – Apache Kafka
YARN integration
An independent model
Hello Samza!
Building a tweet parsing job
The configuration file
Getting Twitter data into Kafka
Running a Samza job
Samza and HDFS
Windowing functions
Multijob workflows
Tweet sentiment analysis
Bootstrap streams
Stateful tasks
Summary
5. Iterative Computation with Spark
Apache Spark
Cluster computing with working sets
Resilient Distributed Datasets (RDDs)
Actions
Deployment
Spark on YARN
Spark on EC2
Getting started with Spark
Writing and running standalone applications
Scala API
Java API
WordCount in Java
Python API
The Spark ecosystem
Spark Streaming
GraphX
MLlib
Spark SQL
Processing data with Apache Spark
Building and running the examples
Running the examples on YARN
Finding popular topics
Assigning a sentiment to topics
Data processing on streams
State management
Data analysis with Spark SQL
SQL on data streams
Comparing Samza and Spark Streaming
Summary
6. Data Analysis with Apache Pig
An overview of Pig
Getting started
Running Pig
Grunt – the Pig interactive shell
Elastic MapReduce
Fundamentals of Apache Pig
Programming Pig
Pig data types
Pig functions
Load/store
Eval
The tuple, bag, and map functions
The math, string, and datetime functions
Dynamic invokers
Macros
Working with data
Filtering
Aggregation
Foreach
Join
Extending Pig (UDFs)
Contributed UDFs
Piggybank
Elephant Bird
Apache DataFu
Analyzing the Twitter stream
Prerequisites
Dataset exploration
Tweet metadata
Data preparation
Top n statistics
Datetime manipulation
Sessions
Capturing user interactions
Link analysis
Influential users
Summary
7. Hadoop and SQL
Why SQL on Hadoop
Other SQL-on-Hadoop solutions
Prerequisites
Overview of Hive
The nature of Hive tables
Hive architecture
Data types
DDL statements
File formats and storage
JSON
Avro
Columnar stores
Queries
Structuring Hive tables for given workloads
Partitioning a table
Overwriting and updating data
Bucketing and sorting
Sampling data
Writing scripts
Hive and Amazon Web Services
Hive and S3
Hive on Elastic MapReduce
Extending HiveQL
Programmatic interfaces
JDBC
Thrift
Stinger initiative
Impala
The architecture of Impala
Co-existing with Hive
A different philosophy
Drill, Tajo, and beyond
Summary
8. Data Lifecycle Management
What data lifecycle management is
Importance of data lifecycle management
Tools to help
Building a tweet analysis capability
Getting the tweet data
Introducing Oozie
A note on HDFS file permissions
Making development a little easier
Extracting data and ingesting into Hive
A note on workflow directory structure
Introducing HCatalog
Using HCatalog
The Oozie sharelib
HCatalog and partitioned tables
Producing derived data
Performing multiple actions in parallel
Calling a subworkflow
Adding global settings
Challenges of external data
Data validation
Validation actions
Handling format changes
Handling schema evolution with Avro
Final thoughts on using Avro schema evolution
Only make additive changes
Manage schema versions explicitly
Think about schema distribution
Collecting additional data
Scheduling work...

Índice