Learning Hadoop 2
eBook - ePub

Learning Hadoop 2

Garry Turkington, Gabriele Modena

Partager le livre
  1. 382 pages
  2. English
  3. ePUB (adapté aux mobiles)
  4. Disponible sur iOS et Android
eBook - ePub

Learning Hadoop 2

Garry Turkington, Gabriele Modena

DĂ©tails du livre
Aperçu du livre
Table des matiĂšres
Citations

Foire aux questions

Comment puis-je résilier mon abonnement ?
Il vous suffit de vous rendre dans la section compte dans paramĂštres et de cliquer sur « RĂ©silier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez rĂ©siliĂ© votre abonnement, il restera actif pour le reste de la pĂ©riode pour laquelle vous avez payĂ©. DĂ©couvrez-en plus ici.
Puis-je / comment puis-je télécharger des livres ?
Pour le moment, tous nos livres en format ePub adaptĂ©s aux mobiles peuvent ĂȘtre tĂ©lĂ©chargĂ©s via l’application. La plupart de nos PDF sont Ă©galement disponibles en tĂ©lĂ©chargement et les autres seront tĂ©lĂ©chargeables trĂšs prochainement. DĂ©couvrez-en plus ici.
Quelle est la différence entre les formules tarifaires ?
Les deux abonnements vous donnent un accĂšs complet Ă  la bibliothĂšque et Ă  toutes les fonctionnalitĂ©s de Perlego. Les seules diffĂ©rences sont les tarifs ainsi que la pĂ©riode d’abonnement : avec l’abonnement annuel, vous Ă©conomiserez environ 30 % par rapport Ă  12 mois d’abonnement mensuel.
Qu’est-ce que Perlego ?
Nous sommes un service d’abonnement Ă  des ouvrages universitaires en ligne, oĂč vous pouvez accĂ©der Ă  toute une bibliothĂšque pour un prix infĂ©rieur Ă  celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! DĂ©couvrez-en plus ici.
Prenez-vous en charge la synthÚse vocale ?
Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte Ă  haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accĂ©lĂ©rer ou le ralentir. DĂ©couvrez-en plus ici.
Est-ce que Learning Hadoop 2 est un PDF/ePUB en ligne ?
Oui, vous pouvez accĂ©der Ă  Learning Hadoop 2 par Garry Turkington, Gabriele Modena en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Informatique et Bases de donnĂ©es. Nous disposons de plus d’un million d’ouvrages Ă  dĂ©couvrir dans notre catalogue.

Informations

Année
2015
ISBN
9781783285518

Learning Hadoop 2


Table of Contents

Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introduction
A note on versioning
The background of Hadoop
Components of Hadoop
Common building blocks
Storage
Computation
Better together
Hadoop 2 – what's the big deal?
Storage in Hadoop 2
Computation in Hadoop 2
Distributions of Apache Hadoop
A dual approach
AWS – infrastructure on demand from Amazon
Simple Storage Service (S3)
Elastic MapReduce (EMR)
Getting started
Cloudera QuickStart VM
Amazon EMR
Creating an AWS account
Signing up for the necessary services
Using Elastic MapReduce
Getting Hadoop up and running
How to use EMR
AWS credentials
The AWS command-line interface
Running the examples
Data processing with Hadoop
Why Twitter?
Building our first dataset
One service, multiple APIs
Anatomy of a Tweet
Twitter credentials
Programmatic access with Python
Summary
2. Storage
The inner workings of HDFS
Cluster startup
NameNode startup
DataNode startup
Block replication
Command-line access to the HDFS filesystem
Exploring the HDFS filesystem
Protecting the filesystem metadata
Secondary NameNode not to the rescue
Hadoop 2 NameNode HA
Keeping the HA NameNodes in sync
Client configuration
How a failover works
Apache ZooKeeper – a different type of filesystem
Implementing a distributed lock with sequential ZNodes
Implementing group membership and leader election using ephemeral ZNodes
Java API
Building blocks
Further reading
Automatic NameNode failover
HDFS snapshots
Hadoop filesystems
Hadoop interfaces
Java FileSystem API
Libhdfs
Thrift
Managing and serializing data
The Writable interface
Introducing the wrapper classes
Array wrapper classes
The Comparable and WritableComparable interfaces
Storing data
Serialization and Containers
Compression
General-purpose file formats
Column-oriented data formats
RCFile
ORC
Parquet
Avro
Using the Java API
Summary
3. Processing – MapReduce and Beyond
MapReduce
Java API to MapReduce
The Mapper class
The Reducer class
The Driver class
Combiner
Partitioning
The optional partition function
Hadoop-provided mapper and reducer implementations
Sharing reference data
Writing MapReduce programs
Getting started
Running the examples
Local cluster
Elastic MapReduce
WordCount, the Hello World of MapReduce
Word co-occurrences
Trending topics
The Top N pattern
Sentiment of hashtags
Text cleanup using chain mapper
Walking through a run of a MapReduce job
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reducer input
Reducer input
Reducer execution
Reducer output
Shutdown
Input/Output
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
OutputFormat and RecordWriter
Hadoop-provided OutputFormat
Sequence files
YARN
YARN architecture
The components of YARN
Anatomy of a YARN application
Life cycle of a YARN application
Fault tolerance and monitoring
Thinking in layers
Execution models
YARN in the real world – Computation beyond MapReduce
The problem with MapReduce
Tez
Hive-on-tez
Apache Spark
Apache Samza
YARN-independent frameworks
YARN today and beyond
Summary
4. Real-time Computation with Samza
Stream processing with Samza
How Samza works
Samza high-level architecture
Samza's best friend – Apache Kafka
YARN integration
An independent model
Hello Samza!
Building a tweet parsing job
The configuration file
Getting Twitter data into Kafka
Running a Samza job
Samza and HDFS
Windowing functions
Multijob workflows
Tweet sentiment analysis
Bootstrap streams
Stateful tasks
Summary
5. Iterative Computation with Spark
Apache Spark
Cluster computing with working sets
Resilient Distributed Datasets (RDDs)
Actions
Deployment
Spark on YARN
Spark on EC2
Getting started with Spark
Writing and running standalone applications
Scala API
Java API
WordCount in Java
Python API
The Spark ecosystem
Spark Streaming
GraphX
MLlib
Spark SQL
Processing data with Apache Spark
Building and running the examples
Running the examples on YARN
Finding popular topics
Assigning a sentiment to topics
Data processing on streams
State management
Data analysis with Spark SQL
SQL on data streams
Comparing Samza and Spark Streaming
Summary
6. Data Analysis with Apache Pig
An overview of Pig
Getting started
Running Pig
Grunt – the Pig interactive shell
Elastic MapReduce
Fundamentals of Apache Pig
Programming Pig
Pig data types
Pig functions
Load/store
Eval
The tuple, bag, and map functions
The math, string, and datetime functions
Dynamic invokers
Macros
Working with data
Filtering
Aggregation
Foreach
Join
Extending Pig (UDFs)
Contributed UDFs
Piggybank
Elephant Bird
Apache DataFu
Analyzing the Twitter stream
Prerequisites
Dataset exploration
Tweet metadata
Data preparation
Top n statistics
Datetime manipulation
Sessions
Capturing user interactions
Link analysis
Influential users
Summary
7. Hadoop and SQL
Why SQL on Hadoop
Other SQL-on-Hadoop solutions
Prerequisites
Overview of Hive
The nature of Hive tables
Hive architecture
Data types
DDL statements
File formats and storage
JSON
Avro
Columnar stores
Queries
Structuring Hive tables for given workloads
Partitioning a table
Overwriting and updating data
Bucketing and sorting
Sampling data
Writing scripts
Hive and Amazon Web Services
Hive and S3
Hive on Elastic MapReduce
Extending HiveQL
Programmatic interfaces
JDBC
Thrift
Stinger initiative
Impala
The architecture of Impala
Co-existing with Hive
A different philosophy
Drill, Tajo, and beyond
Summary
8. Data Lifecycle Management
What data lifecycle management is
Importance of data lifecycle management
Tools to help
Building a tweet analysis capability
Getting the tweet data
Introducing Oozie
A note on HDFS file permissions
Making development a little easier
Extracting data and ingesting into Hive
A note on workflow directory structure
Introducing HCatalog
Using HCatalog
The Oozie sharelib
HCatalog and partitioned tables
Producing derived data
Performing multiple actions in parallel
Calling a subworkflow
Adding global settings
Challenges of external data
Data validation
Validation actions
Handling format changes
Handling schema evolution with Avro
Final thoughts on using Avro schema evolution
Only make additive changes
Manage schema versions explicitly
Think about schema distribution
Collecting additional data
Scheduling work...

Table des matiĂšres