Learning Hadoop 2
eBook - ePub

Learning Hadoop 2

Garry Turkington, Gabriele Modena

Condividi libro
  1. 382 pagine
  2. English
  3. ePUB (disponibile sull'app)
  4. Disponibile su iOS e Android
eBook - ePub

Learning Hadoop 2

Garry Turkington, Gabriele Modena

Dettagli del libro
Anteprima del libro
Indice dei contenuti
Citazioni

Domande frequenti

Come faccio ad annullare l'abbonamento?
È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui
È possibile scaricare libri? Se sì, come?
Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui
Che differenza c'è tra i piani?
Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.
Cos'è Perlego?
Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.
Perlego supporta la sintesi vocale?
Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.
Learning Hadoop 2 è disponibile online in formato PDF/ePub?
Sì, puoi accedere a Learning Hadoop 2 di Garry Turkington, Gabriele Modena in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Informatique e Bases de données. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Anno
2015
ISBN
9781783285518

Learning Hadoop 2


Table of Contents

Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introduction
A note on versioning
The background of Hadoop
Components of Hadoop
Common building blocks
Storage
Computation
Better together
Hadoop 2 – what's the big deal?
Storage in Hadoop 2
Computation in Hadoop 2
Distributions of Apache Hadoop
A dual approach
AWS – infrastructure on demand from Amazon
Simple Storage Service (S3)
Elastic MapReduce (EMR)
Getting started
Cloudera QuickStart VM
Amazon EMR
Creating an AWS account
Signing up for the necessary services
Using Elastic MapReduce
Getting Hadoop up and running
How to use EMR
AWS credentials
The AWS command-line interface
Running the examples
Data processing with Hadoop
Why Twitter?
Building our first dataset
One service, multiple APIs
Anatomy of a Tweet
Twitter credentials
Programmatic access with Python
Summary
2. Storage
The inner workings of HDFS
Cluster startup
NameNode startup
DataNode startup
Block replication
Command-line access to the HDFS filesystem
Exploring the HDFS filesystem
Protecting the filesystem metadata
Secondary NameNode not to the rescue
Hadoop 2 NameNode HA
Keeping the HA NameNodes in sync
Client configuration
How a failover works
Apache ZooKeeper – a different type of filesystem
Implementing a distributed lock with sequential ZNodes
Implementing group membership and leader election using ephemeral ZNodes
Java API
Building blocks
Further reading
Automatic NameNode failover
HDFS snapshots
Hadoop filesystems
Hadoop interfaces
Java FileSystem API
Libhdfs
Thrift
Managing and serializing data
The Writable interface
Introducing the wrapper classes
Array wrapper classes
The Comparable and WritableComparable interfaces
Storing data
Serialization and Containers
Compression
General-purpose file formats
Column-oriented data formats
RCFile
ORC
Parquet
Avro
Using the Java API
Summary
3. Processing – MapReduce and Beyond
MapReduce
Java API to MapReduce
The Mapper class
The Reducer class
The Driver class
Combiner
Partitioning
The optional partition function
Hadoop-provided mapper and reducer implementations
Sharing reference data
Writing MapReduce programs
Getting started
Running the examples
Local cluster
Elastic MapReduce
WordCount, the Hello World of MapReduce
Word co-occurrences
Trending topics
The Top N pattern
Sentiment of hashtags
Text cleanup using chain mapper
Walking through a run of a MapReduce job
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reducer input
Reducer input
Reducer execution
Reducer output
Shutdown
Input/Output
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
OutputFormat and RecordWriter
Hadoop-provided OutputFormat
Sequence files
YARN
YARN architecture
The components of YARN
Anatomy of a YARN application
Life cycle of a YARN application
Fault tolerance and monitoring
Thinking in layers
Execution models
YARN in the real world – Computation beyond MapReduce
The problem with MapReduce
Tez
Hive-on-tez
Apache Spark
Apache Samza
YARN-independent frameworks
YARN today and beyond
Summary
4. Real-time Computation with Samza
Stream processing with Samza
How Samza works
Samza high-level architecture
Samza's best friend – Apache Kafka
YARN integration
An independent model
Hello Samza!
Building a tweet parsing job
The configuration file
Getting Twitter data into Kafka
Running a Samza job
Samza and HDFS
Windowing functions
Multijob workflows
Tweet sentiment analysis
Bootstrap streams
Stateful tasks
Summary
5. Iterative Computation with Spark
Apache Spark
Cluster computing with working sets
Resilient Distributed Datasets (RDDs)
Actions
Deployment
Spark on YARN
Spark on EC2
Getting started with Spark
Writing and running standalone applications
Scala API
Java API
WordCount in Java
Python API
The Spark ecosystem
Spark Streaming
GraphX
MLlib
Spark SQL
Processing data with Apache Spark
Building and running the examples
Running the examples on YARN
Finding popular topics
Assigning a sentiment to topics
Data processing on streams
State management
Data analysis with Spark SQL
SQL on data streams
Comparing Samza and Spark Streaming
Summary
6. Data Analysis with Apache Pig
An overview of Pig
Getting started
Running Pig
Grunt – the Pig interactive shell
Elastic MapReduce
Fundamentals of Apache Pig
Programming Pig
Pig data types
Pig functions
Load/store
Eval
The tuple, bag, and map functions
The math, string, and datetime functions
Dynamic invokers
Macros
Working with data
Filtering
Aggregation
Foreach
Join
Extending Pig (UDFs)
Contributed UDFs
Piggybank
Elephant Bird
Apache DataFu
Analyzing the Twitter stream
Prerequisites
Dataset exploration
Tweet metadata
Data preparation
Top n statistics
Datetime manipulation
Sessions
Capturing user interactions
Link analysis
Influential users
Summary
7. Hadoop and SQL
Why SQL on Hadoop
Other SQL-on-Hadoop solutions
Prerequisites
Overview of Hive
The nature of Hive tables
Hive architecture
Data types
DDL statements
File formats and storage
JSON
Avro
Columnar stores
Queries
Structuring Hive tables for given workloads
Partitioning a table
Overwriting and updating data
Bucketing and sorting
Sampling data
Writing scripts
Hive and Amazon Web Services
Hive and S3
Hive on Elastic MapReduce
Extending HiveQL
Programmatic interfaces
JDBC
Thrift
Stinger initiative
Impala
The architecture of Impala
Co-existing with Hive
A different philosophy
Drill, Tajo, and beyond
Summary
8. Data Lifecycle Management
What data lifecycle management is
Importance of data lifecycle management
Tools to help
Building a tweet analysis capability
Getting the tweet data
Introducing Oozie
A note on HDFS file permissions
Making development a little easier
Extracting data and ingesting into Hive
A note on workflow directory structure
Introducing HCatalog
Using HCatalog
The Oozie sharelib
HCatalog and partitioned tables
Producing derived data
Performing multiple actions in parallel
Calling a subworkflow
Adding global settings
Challenges of external data
Data validation
Validation actions
Handling format changes
Handling schema evolution with Avro
Final thoughts on using Avro schema evolution
Only make additive changes
Manage schema versions explicitly
Think about schema distribution
Collecting additional data
Scheduling work...

Indice dei contenuti