![]()
Table of Contents
Hadoop Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Time for action – heading
What just happened?
Pop quiz – heading
Have a go hero – heading
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. What It's All About
Big data processing
The value of data
Historically for the few and not the many
Classic data processing systems
Scale-up
Early approaches to scale-out
Limiting factors
A different approach
All roads lead to scale-out
Share nothing
Expect failure
Smart software, dumb hardware
Move processing, not data
Build applications, not infrastructure
Hadoop
Thanks, Google
Thanks, Doug
Thanks, Yahoo
Parts of Hadoop
Common building blocks
HDFS
MapReduce
Better together
Common architecture
What it is and isn't good for
Cloud computing with Amazon Web Services
Too many clouds
A third way
Different types of costs
AWS – infrastructure on demand from Amazon
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic MapReduce (EMR)
What this book covers
A dual approach
Summary
2. Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Other operating systems
Time for action – checking the prerequisites
What just happened?
Setting up Hadoop
A note on versions
Time for action – downloading Hadoop
What just happened?
Time for action – setting up SSH
What just happened?
Configuring and running Hadoop
Time for action – using Hadoop to calculate Pi
What just happened?
Three modes
Time for action – configuring the pseudo-distributed mode
What just happened?
Configuring the base directory and formatting the filesystem
Time for action – changing the base HDFS directory
What just happened?
Time for action – formatting the NameNode
What just happened?
Starting and using Hadoop
Time for action – starting Hadoop
What just happened?
Time for action – using HDFS
What just happened?
Time for action – WordCount, the Hello World of MapReduce
What just happened?
Have a go hero – WordCount on a larger body of text
Monitoring Hadoop from the browser
The HDFS web UI
The MapReduce web UI
Using Elastic MapReduce
Setting up an account in Amazon Web Services
Creating an AWS account
Signing up for the necessary services
Time for action – WordCount on EMR using the management console
What just happened?
Have a go hero – other EMR sample applications
Other ways of using EMR
AWS credentials
The EMR command-line tools
The AWS ecosystem
Comparison of local versus EMR Hadoop
Summary
3. Understanding MapReduce
Key/value pairs
What it mean
Why key/value data?
Some real-world examples
MapReduce as a series of key/value transformations
Pop quiz – key/value pairs
The Hadoop Java API for MapReduce
The 0.20 MapReduce Java API
The Mapper class
The Reducer class
The Driver class
Writing MapReduce programs
Time for action – setting up the classpath
What just happened?
Time for action – implementing WordCount
What just happened?
Time for action – building a JAR file
What just happened?
Time for action – running WordCount on a local Hadoop cluster
What just happened?
Time for action – running WordCount on EMR
What just happened?
The pre-0.20 Java MapReduce API
Hadoop-provided mapper and reducer implementations
Time for action – WordCount the easy way
What just happened?
Walking through a run of WordCount
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reduce input
Partitioning
The optional partition function
Reducer input
Reducer execution
Reducer output
Shutdown
That's all there is to it!
Apart from the combiner…maybe
Why have a combiner?
Time for action – WordCount with a combiner
What just happened?
When you can use the reducer as the combiner
Time for action – fixing WordCount to work with a combiner
What just happened?
Reuse is your friend
Pop quiz – MapReduce mechanics
Hadoop-specific data types
The Writable and WritableComparable interfaces
Introducing the wrapper classes
Primitive wrapper classes
Array wrapper classes
Map wrapper classes
Time for action – using the Writable wrapper classes
What just happened?
Other wrapper classes
Have a go hero – playing with Writables
Making your own
Input/output
Files, splits, and records
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
OutputFormat and RecordWriter
Hadoop-provided OutputFormat
Don't forget Sequence files
Summary
4. Developing MapReduce Programs
Using languages other than Java with Hadoop
How Hadoop Streaming works
Why to use Hadoop Streaming
Time for action – implementing WordCount using Streaming
What just happened?
Differences in jobs when using Streaming
Analyzing a large dataset
Getting the UFO sighting dataset
Getting a feel for the dataset
Time for action – summarizing the UFO data
What just happened?
Examining UFO shapes
Time for action – summarizing the shape data
What just happened?
Time for action – correlating of sighting duration to UFO shape
What just happened?
Using Streaming scripts outside Hadoop
Time for action – performing the shape/time analysis from the command line
What just happened?
Java shape and location analysis
Time for action – using ChainMapper for field validation/analysis
What just happened?
Have a go hero
Too many abbreviations
Using the Distributed Cache
Time for action – using the Distributed Cache to improve location output
What just happened?
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
What just happened?
Too much information!
Summary
5. Advanced MapReduce Techniques
Simple, adva...