Learning Hadoop 2
eBook - ePub

Learning Hadoop 2

Garry Turkington, Gabriele Modena

Share book
  1. 382 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Learning Hadoop 2

Garry Turkington, Gabriele Modena

Book details
Book preview
Table of contents
Citations

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Learning Hadoop 2 an online PDF/ePUB?
Yes, you can access Learning Hadoop 2 by Garry Turkington, Gabriele Modena in PDF and/or ePUB format, as well as other popular books in Informatique & Bases de données. We have over one million books available in our catalogue for you to explore.

Information

Year
2015
ISBN
9781783285518

Learning Hadoop 2


Table of Contents

Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introduction
A note on versioning
The background of Hadoop
Components of Hadoop
Common building blocks
Storage
Computation
Better together
Hadoop 2 – what's the big deal?
Storage in Hadoop 2
Computation in Hadoop 2
Distributions of Apache Hadoop
A dual approach
AWS – infrastructure on demand from Amazon
Simple Storage Service (S3)
Elastic MapReduce (EMR)
Getting started
Cloudera QuickStart VM
Amazon EMR
Creating an AWS account
Signing up for the necessary services
Using Elastic MapReduce
Getting Hadoop up and running
How to use EMR
AWS credentials
The AWS command-line interface
Running the examples
Data processing with Hadoop
Why Twitter?
Building our first dataset
One service, multiple APIs
Anatomy of a Tweet
Twitter credentials
Programmatic access with Python
Summary
2. Storage
The inner workings of HDFS
Cluster startup
NameNode startup
DataNode startup
Block replication
Command-line access to the HDFS filesystem
Exploring the HDFS filesystem
Protecting the filesystem metadata
Secondary NameNode not to the rescue
Hadoop 2 NameNode HA
Keeping the HA NameNodes in sync
Client configuration
How a failover works
Apache ZooKeeper – a different type of filesystem
Implementing a distributed lock with sequential ZNodes
Implementing group membership and leader election using ephemeral ZNodes
Java API
Building blocks
Further reading
Automatic NameNode failover
HDFS snapshots
Hadoop filesystems
Hadoop interfaces
Java FileSystem API
Libhdfs
Thrift
Managing and serializing data
The Writable interface
Introducing the wrapper classes
Array wrapper classes
The Comparable and WritableComparable interfaces
Storing data
Serialization and Containers
Compression
General-purpose file formats
Column-oriented data formats
RCFile
ORC
Parquet
Avro
Using the Java API
Summary
3. Processing – MapReduce and Beyond
MapReduce
Java API to MapReduce
The Mapper class
The Reducer class
The Driver class
Combiner
Partitioning
The optional partition function
Hadoop-provided mapper and reducer implementations
Sharing reference data
Writing MapReduce programs
Getting started
Running the examples
Local cluster
Elastic MapReduce
WordCount, the Hello World of MapReduce
Word co-occurrences
Trending topics
The Top N pattern
Sentiment of hashtags
Text cleanup using chain mapper
Walking through a run of a MapReduce job
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reducer input
Reducer input
Reducer execution
Reducer output
Shutdown
Input/Output
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
OutputFormat and RecordWriter
Hadoop-provided OutputFormat
Sequence files
YARN
YARN architecture
The components of YARN
Anatomy of a YARN application
Life cycle of a YARN application
Fault tolerance and monitoring
Thinking in layers
Execution models
YARN in the real world – Computation beyond MapReduce
The problem with MapReduce
Tez
Hive-on-tez
Apache Spark
Apache Samza
YARN-independent frameworks
YARN today and beyond
Summary
4. Real-time Computation with Samza
Stream processing with Samza
How Samza works
Samza high-level architecture
Samza's best friend – Apache Kafka
YARN integration
An independent model
Hello Samza!
Building a tweet parsing job
The configuration file
Getting Twitter data into Kafka
Running a Samza job
Samza and HDFS
Windowing functions
Multijob workflows
Tweet sentiment analysis
Bootstrap streams
Stateful tasks
Summary
5. Iterative Computation with Spark
Apache Spark
Cluster computing with working sets
Resilient Distributed Datasets (RDDs)
Actions
Deployment
Spark on YARN
Spark on EC2
Getting started with Spark
Writing and running standalone applications
Scala API
Java API
WordCount in Java
Python API
The Spark ecosystem
Spark Streaming
GraphX
MLlib
Spark SQL
Processing data with Apache Spark
Building and running the examples
Running the examples on YARN
Finding popular topics
Assigning a sentiment to topics
Data processing on streams
State management
Data analysis with Spark SQL
SQL on data streams
Comparing Samza and Spark Streaming
Summary
6. Data Analysis with Apache Pig
An overview of Pig
Getting started
Running Pig
Grunt – the Pig interactive shell
Elastic MapReduce
Fundamentals of Apache Pig
Programming Pig
Pig data types
Pig functions
Load/store
Eval
The tuple, bag, and map functions
The math, string, and datetime functions
Dynamic invokers
Macros
Working with data
Filtering
Aggregation
Foreach
Join
Extending Pig (UDFs)
Contributed UDFs
Piggybank
Elephant Bird
Apache DataFu
Analyzing the Twitter stream
Prerequisites
Dataset exploration
Tweet metadata
Data preparation
Top n statistics
Datetime manipulation
Sessions
Capturing user interactions
Link analysis
Influential users
Summary
7. Hadoop and SQL
Why SQL on Hadoop
Other SQL-on-Hadoop solutions
Prerequisites
Overview of Hive
The nature of Hive tables
Hive architecture
Data types
DDL statements
File formats and storage
JSON
Avro
Columnar stores
Queries
Structuring Hive tables for given workloads
Partitioning a table
Overwriting and updating data
Bucketing and sorting
Sampling data
Writing scripts
Hive and Amazon Web Services
Hive and S3
Hive on Elastic MapReduce
Extending HiveQL
Programmatic interfaces
JDBC
Thrift
Stinger initiative
Impala
The architecture of Impala
Co-existing with Hive
A different philosophy
Drill, Tajo, and beyond
Summary
8. Data Lifecycle Management
What data lifecycle management is
Importance of data lifecycle management
Tools to help
Building a tweet analysis capability
Getting the tweet data
Introducing Oozie
A note on HDFS file permissions
Making development a little easier
Extracting data and ingesting into Hive
A note on workflow directory structure
Introducing HCatalog
Using HCatalog
The Oozie sharelib
HCatalog and partitioned tables
Producing derived data
Performing multiple actions in parallel
Calling a subworkflow
Adding global settings
Challenges of external data
Data validation
Validation actions
Handling format changes
Handling schema evolution with Avro
Final thoughts on using Avro schema evolution
Only make additive changes
Manage schema versions explicitly
Think about schema distribution
Collecting additional data
Scheduling work...

Table of contents