eBook - ePub

Hadoop Beginner's Guide

Name: Hadoop Beginner's Guide
ISBN: 9781849517300

Garry Turkington,

398 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Hadoop Beginner's Guide

Garry Turkington,

About this book

In Detail

Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills.

"Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.

Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.

While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.

In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

Approach

As a Packt Beginner's Guide, the book is packed with clear step-by-step instructions for performing the most useful tasks, getting you up and running quickly, and learning by doing.

Who this book is for

This book assumes no existing experience with Hadoop or cloud services. It assumes you have familiarity with a programming language such as Java or Ruby but gives you the needed background on the other topics.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Packt Publishing

Year

2013

Edition

eBook ISBN

9781849517300

Topic

Computer Science

Subtopic

CAD-CAM

Index

Computer Science

Hadoop Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Time for action – heading

What just happened?

Pop quiz – heading

Have a go hero – heading

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. What It's All About

Big data processing

The value of data

Historically for the few and not the many

Classic data processing systems

Scale-up

Early approaches to scale-out

Limiting factors

A different approach

All roads lead to scale-out

Share nothing

Expect failure

Smart software, dumb hardware

Move processing, not data

Build applications, not infrastructure

Hadoop

Thanks, Google

Thanks, Doug

Thanks, Yahoo

Parts of Hadoop

Common building blocks

HDFS

MapReduce

Better together

Common architecture

What it is and isn't good for

Cloud computing with Amazon Web Services

Too many clouds

A third way

Different types of costs

AWS – infrastructure on demand from Amazon

Elastic Compute Cloud (EC2)

Simple Storage Service (S3)

Elastic MapReduce (EMR)

What this book covers

A dual approach

Summary

2. Getting Hadoop Up and Running

Hadoop on a local Ubuntu host

Other operating systems

Time for action – checking the prerequisites

What just happened?

Setting up Hadoop

A note on versions

Time for action – downloading Hadoop

What just happened?

Time for action – setting up SSH

What just happened?

Configuring and running Hadoop

Time for action – using Hadoop to calculate Pi

What just happened?

Three modes

Time for action – configuring the pseudo-distributed mode

What just happened?

Configuring the base directory and formatting the filesystem

Time for action – changing the base HDFS directory

What just happened?

Time for action – formatting the NameNode

What just happened?

Starting and using Hadoop

Time for action – starting Hadoop

What just happened?

Time for action – using HDFS

What just happened?

Time for action – WordCount, the Hello World of MapReduce

What just happened?

Have a go hero – WordCount on a larger body of text

Monitoring Hadoop from the browser

The HDFS web UI

The MapReduce web UI

Using Elastic MapReduce

Setting up an account in Amazon Web Services

Creating an AWS account

Signing up for the necessary services

Time for action – WordCount on EMR using the management console

What just happened?

Have a go hero – other EMR sample applications

Other ways of using EMR

AWS credentials

The EMR command-line tools

The AWS ecosystem

Comparison of local versus EMR Hadoop

Summary

3. Understanding MapReduce

Key/value pairs

What it mean

Why key/value data?

Some real-world examples

MapReduce as a series of key/value transformations

Pop quiz – key/value pairs

The Hadoop Java API for MapReduce

The 0.20 MapReduce Java API

The Mapper class

The Reducer class

The Driver class

Writing MapReduce programs

Time for action – setting up the classpath

What just happened?

Time for action – implementing WordCount

What just happened?

Time for action – building a JAR file

What just happened?

Time for action – running WordCount on a local Hadoop cluster

What just happened?

Time for action – running WordCount on EMR

What just happened?

The pre-0.20 Java MapReduce API

Hadoop-provided mapper and reducer implementations

Time for action – WordCount the easy way

What just happened?

Walking through a run of WordCount

Startup

Splitting the input

Task assignment

Task startup

Ongoing JobTracker monitoring

Mapper input

Mapper execution

Mapper output and reduce input

Partitioning

The optional partition function

Reducer input

Reducer execution

Reducer output

Shutdown

That's all there is to it!

Apart from the combiner…maybe

Why have a combiner?

Time for action – WordCount with a combiner

What just happened?

When you can use the reducer as the combiner

Time for action – fixing WordCount to work with a combiner

What just happened?

Reuse is your friend

Pop quiz – MapReduce mechanics

Hadoop-specific data types

The Writable and WritableComparable interfaces

Introducing the wrapper classes

Primitive wrapper classes

Array wrapper classes

Map wrapper classes

Time for action – using the Writable wrapper classes

What just happened?

Other wrapper classes

Have a go hero – playing with Writables

Making your own

Input/output

Files, splits, and records

InputFormat and RecordReader

Hadoop-provided InputFormat

Hadoop-provided RecordReader

OutputFormat and RecordWriter

Hadoop-provided OutputFormat

Don't forget Sequence files

Summary

4. Developing MapReduce Programs

Using languages other than Java with Hadoop

How Hadoop Streaming works

Why to use Hadoop Streaming

Time for action – implementing WordCount using Streaming

What just happened?

Differences in jobs when using Streaming

Analyzing a large dataset

Getting the UFO sighting dataset

Getting a feel for the dataset

Time for action – summarizing the UFO data

What just happened?

Examining UFO shapes

Time for action – summarizing the shape data

What just happened?

Time for action – correlating of sighting duration to UFO shape

What just happened?

Using Streaming scripts outside Hadoop

Time for action – performing the shape/time analysis from the command line

What just happened?

Java shape and location analysis

Time for action – using ChainMapper for field validation/analysis

What just happened?

Have a go hero

Too many abbreviations

Using the Distributed Cache

Time for action – using the Distributed Cache to improve location output

What just happened?

Counters, status, and other output

Time for action – creating counters, task states, and writing log output

What just happened?

Too much information!

Summary

5. Advanced MapReduce Techniques

Simple, adva...

Hadoop Beginner's Guide

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Hadoop Beginner's Guide an online PDF/ePUB?

Yes, you can access Hadoop Beginner's Guide by Garry Turkington in PDF and/or ePUB format, as well as other popular books in Computer Science & CAD-CAM. We have over 1.5 million books available in our catalogue for you to explore.

Hadoop Beginner's Guide

Hadoop Beginner's Guide

About this book

In Detail

Approach

Who this book is for

Trusted by 375,005 students

Information

Hadoop Beginner's Guide

Table of Contents

Table of contents

Frequently asked questions