Mastering Hadoop
eBook - ePub

Mastering Hadoop

  1. 374 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Mastering Hadoop

About this book

Hadoop is synonymous with Big Data processing. Its simple programming model, "code once and deploy at any scale" paradigm, and an ever-growing ecosystem makes Hadoop an all-encompassing platform for programmers with different levels of expertise.

This book explores the industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0. Then, it dives deep into Hadoop 2.0 specific features such as YARN and HDFS Federation.

This book is a step-by-step guide that focuses on advanced Hadoop concepts and aims to take your Hadoop knowledge and skill set to the next level. The data processing flow dictates the order of the concepts in each chapter, and each chapter is illustrated with code fragments or schematic diagrams.

Trusted by 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Mastering Hadoop


Table of Contents

Mastering Hadoop
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book?
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Hadoop 2.X
The inception of Hadoop
The evolution of Hadoop
Hadoop's genealogy
Hadoop-0.20-append
Hadoop-0.20-security
Hadoop's timeline
Hadoop 2.X
Yet Another Resource Negotiator (YARN)
Architecture overview
Storage layer enhancements
High availability
HDFS Federation
HDFS snapshots
Other enhancements
Support enhancements
Hadoop distributions
Which Hadoop distribution?
Performance
Scalability
Reliability
Manageability
Available distributions
Cloudera Distribution of Hadoop (CDH)
Hortonworks Data Platform (HDP)
MapR
Pivotal HD
Summary
2. Advanced MapReduce
MapReduce input
The InputFormat class
The InputSplit class
The RecordReader class
Hadoop's "small files" problem
Filtering inputs
The Map task
The dfs.blocksize attribute
Sort and spill of intermediate outputs
Node-local Reducers or Combiners
Fetching intermediate outputs – Map-side
The Reduce task
Fetching intermediate outputs – Reduce-side
Merge and spill of intermediate outputs
MapReduce output
Speculative execution of tasks
MapReduce job counters
Handling data joins
Reduce-side joins
Map-side joins
Summary
3. Advanced Pig
Pig versus SQL
Different modes of execution
Complex data types in Pig
Compiling Pig scripts
The logical plan
The physical plan
The MapReduce plan
Development and debugging aids
The DESCRIBE command
The EXPLAIN command
The ILLUSTRATE command
The advanced Pig operators
The advanced FOREACH operator
The FLATTEN operator
The nested FOREACH operator
The COGROUP operator
The UNION operator
The CROSS operator
Specialized joins in Pig
The Replicated join
Skewed joins
The Merge join
User-defined functions
The evaluation functions
The aggregate functions
The Algebraic interface
The Accumulator interface
The filter functions
The load functions
The store functions
Pig performance optimizations
The optimization rules
Measurement of Pig script performance
Combiners in Pig
Memory for the Bag data type
Number of reducers in Pig
The multiquery mode in Pig
Best practices
The explicit usage of types
Early and frequent projection
Early and frequent filtering
The usage of the LIMIT operator
The usage of the DISTINCT operator
The reduction of operations
The usage of Algebraic UDFs
The usage of Accumulator UDFs
Eliminating nulls in the data
The usage of specialized joins
Compressing intermediate results
Combining smaller files
Summary
4. Advanced Hive
The Hive architecture
The Hive metastore
The Hive compiler
The Hive execution engine
The supporting components of Hive
Data types
File formats
Compressed files
ORC files
The Parquet files
The data model
Dynamic partitions
Semantics for dynamic partitioning
Indexes on Hive tables
Hive query optimizers
Advanced DML
The GROUP BY operation
ORDER BY versus SORT BY clauses
The JOIN operator and its types
Map-side joins
Advanced aggregation support
Other advanced clauses
UDF, UDAF, and UDTF
Summary
5. Serialization and Hadoop I/O
Data serialization in Hadoop
Writable and WritableComparable
Hadoop versus Java serialization
Avro serialization
Avro and MapReduce
Avro and Pig
Avro and Hive
Comparison – Avro versus Protocol Buffers / Thrift
File formats
The Sequence file format
Reading and writing Sequence files
The MapFile format
Other data structures
Compression
Splits and compressions
Scope for compression
Summary
6. YARN – Bringing Other Paradigms to Hadoop
The YARN architecture
Resource Manager (RM)
Application Master (AM)
Node Manager (NM)
YARN clients
Developing YARN applications
Writing YARN clients
Writing the Application Master entity
Monitoring YARN
Job scheduling in YARN
CapacityScheduler
FairScheduler
YARN commands
User commands
Administration commands
Summary
7. Storm on YARN – Low Latency Processing in Hadoop
Batch processing versus streaming
Apache Storm
Architecture of an Apache Storm cluster
Computation and data modeling in Apache Storm
Use cases for Apache Storm
Developing with Apache Storm
Apache Storm 0.9.1
Storm on YARN
Installing Apache Storm-on-YARN
Prerequisites
Installation procedure
Summary
8. Hadoop on the Cloud
Cloud computing characteristics
Hadoop on the cloud
Amazon Elastic MapReduce (EMR)
Provisioning a Hadoop cluster on EMR
Summary
9. HDFS Replacements
HDFS – advantages and drawbacks
Amazon AWS S3
Hadoop support for S3
Implementing a filesystem in Hadoop
Implementing an S3 native filesystem in Hadoop
Summary
10. HDFS Federation
Limitations of the older HDFS architecture
Architecture of HDFS Federation
Benefits of HDFS Federation
Deploying federated NameNodes
HDFS high availability
Secondary NameNode, Checkpoint Node, and Backup Node
High availability – edits sharing
Useful HDFS tools
Three-layer versus four-layer network topology
HDFS block placement
Pluggable block placement policy
Summary
11. Hadoop Security
The security pillars
Authentication in Hadoop
Kerberos authentication
The Kerberos architecture and workflow
Kerberos authentication and Hadoop
Authentication via HTTP interfaces
Authorization in Hadoop
Authorization in HDFS
Identity of an HDFS user
Group listings for an HDFS user
HDFS APIs and shell commands
Specifying the HDFS superuser
Turning off HDFS authorization
Limiting HDFS usage
Name quotas in HDFS
Space quotas in HDFS
Service-level authorization in Hadoop
Data confidentiality in Hadoop
HTTPS and encrypted shuffle
SSL configuration changes
Configuring the keystore and truststore
Audit logging in Hadoop
Summary
12. Analytics Using Hadoop
Data analytics workflow
Machine learning
Apache Mahout
Document analysis using Hadoop and Mahout
Term frequency
Document frequency
Term frequency – inverse document frequency
Tf-Idf in Pig
Cosine similarity distance measures
Clustering using k-means
K-means clustering using Apache Mahout
RHadoop
Summary
A. Hadoop for Microsoft Windows
Deploying Hadoop on Microsoft Windows
Prerequisites
Building Hadoop
Configuring Hadoop
Deploying Hadoop
Summary
Index

Mastering Hadoop

Copyright © 2014 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publis...

Table of contents

  1. Mastering Hadoop

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Mastering Hadoop by Sandeep Karanth in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.