Apache Hadoop 3 Quick Start Guide
eBook - ePub

Apache Hadoop 3 Quick Start Guide

Learn about big data processing and analytics

  1. 220 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Apache Hadoop 3 Quick Start Guide

Learn about big data processing and analytics

About this book

A fast paced guide that will help you learn about Apache Hadoop 3 and its ecosystem

Key Features

  • Set up, configure and get started with Hadoop to get useful insights from large data sets
  • Work with the different components of Hadoop such as MapReduce, HDFS and YARN
  • Learn about the new features introduced in Hadoop 3

Book Description

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics, including MapReduce, YARN, and HDFS.

The book begins with an overview of big data and Apache Hadoop. Then, you will set up a pseudo Hadoop development environment and a multi-node enterprise Hadoop cluster. You will see how the parallel programming paradigm, such as MapReduce, can solve many complex data processing problems.

The book also covers the important aspects of the big data software development lifecycle, including quality assurance and control, performance, administration, and monitoring.

You will then learn about the Hadoop ecosystem, and tools such as Kafka, Sqoop, Flume, Pig, Hive, and HBase. Finally, you will look at advanced topics, including real time streaming using Apache Storm, and data analytics using Apache Spark.

By the end of the book, you will be well versed with different configurations of the Hadoop 3 cluster.

What you will learn

  • Store and analyze data at scale using HDFS, MapReduce and YARN
  • Install and configure Hadoop 3 in different modes
  • Use Yarn effectively to run different applications on Hadoop based platform
  • Understand and monitor how Hadoop cluster is managed
  • Consume streaming data using Storm, and then analyze it using Spark
  • Explore Apache Hadoop ecosystem components, such as Flume, Sqoop, HBase, Hive, and Kafka

Who this book is for

Aspiring Big Data professionals who want to learn the essentials of Hadoop 3 will find this book to be useful. Existing Hadoop users who want to get up to speed with the new features introduced in Hadoop 3 will also benefit from this book. Having knowledge of Java programming will be an added advantage.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Apache Hadoop 3 Quick Start Guide by Hrishikesh Vijay Karambelkar in PDF and/or ePUB format, as well as other popular books in Informatique & Modélisation et conception de données. We have over one million books available in our catalogue for you to explore.

Developing MapReduce Applications

"Programs must be written for people to read, and only incidentally for machines to execute."
– Harold Abelson, Structure and Interpretation of Computer Programs, 1984
When Apache Hadoop was designed, it was intended for large-scale processing of humongous data, where traditional programming techniques could not be applied. This was at a time when MapReduce was considered a part of Apache Hadoop. Earlier, MapReduce was the only programming option available in Hadoop; however, with new Hadoop releases, it was enhanced with YARN. It's also called MRv2 and older MapReduce is usually referred to as MRv1. In the previous chapter, we saw how HDFS can be configured and used for various application usages. In this chapter, we will do a deep dive into MapReduce programming to learn the different facets of how you can effectively use MapReduce programming to solve various complex problems.
This chapter assumes that you are well-versed in Java programming, as most of the MapReduce programs are based on Java. I am using Hadoop version 3.1 with Java 8 for all examples and work.
We will cover the following topics:
  • How MapReduce works
  • Configuring a MapReduce environment
  • Understanding Hadoop APIs and packages
  • Setting up a MapReduce project
  • Deep diving into MapReduce APIs
  • Compiling and running MapReduce jobs
  • Streaming in MapReduce programming

Technical requirements

You will need Eclipse development environment and Java 8 installed on your system where you can run/tweak these examples. If you prefer to use maven, then you will need maven installed to compile the code. To run the example, you also need Apache Hadoop 3.1 setup on Linux system. Finally, to use the Git repository of this book, you need to install Git.
The code files of this chapter can be found on GitHub:
https://github.com/PacktPublishing/Apache-Hadoop-3-Quick-Start-Guide/tree/master/Chapter4
Check out the following video to see the code in action:
http://bit.ly/2znViEb

How MapReduce works

MapReduce is a programming methodology used for writing programs on Apache Hadoop. It allows the programs to run on a large scalable cluster of servers. MapReduce was inspired by functional programming (https://en.wikipedia.org/wiki/Functional_programming). Functional Programming (FP) offers amazing unique features when compared to today's popular programming paradigms such as object-oriented (Java and JavaScript), declarative (SQL and CSS), or procedural (C, PHP, and Python). You can look at a comparison between multiple programming paradigms here. While we see a lot of interest in functional programming in academics, we rarely see equivalent enthusiasm from the developer community. Many developers and mentors claim that MapReduce is not actually a functional programming paradigm. Higher order functions in FP are functions that can take a function as a parameter or return a function (https://en.wikipedia.org/wiki/Higher-order_function). Map and Reduce are among the most widely used higher-order functions of functional programming. In this section, we will try to understand how MapReduce works in Hadoop.

What is MapReduce?

MapReduce programming provides a simpler framework to write complex processing on cluster applications. Although the programming model is simple, it is difficult to implement or convert any standard programs. Any job in MapReduce is seen as a combination of the map function and the reduce function. All of the activities are broken into these two phases. Each phase communicates with the other phase through standard input and output, comprising keys and their values. The following data flow diagram shows how MapReduce programming resolves different problems with its methodology. The color denotes similar entities, the circle denotes the processing units (either map or reduce), and the square boxes denote the data elements or data chunks:
In the Map phase, the map function collects data in the form of <key, value> pairs from HDFS and converts it into another set of <key, value> pairs, whereas in the Reduce phase, the <key, value> pair generated from the Map function is passed as input to the reduce function, which eventually produces another set of <key, value> pairs as output. This output gets stored in HDFS by default.

An example of MapReduce

Let's understand the MapReduce concept with a simple example:
  • Problem: There is an e-commerce company that offers different products for purchase through online sale. The task is to find out the items that are sold in each of the cities. The following is the available information:
  • Solution: As you can see, we need to perform the right outer join across these tables to get the city-wise item sale report. I am sure database experts who are reading this book can simply write a SQL query, to do this join using database. It works well in general. When we look at high-volume data processing, this can be alternatively performed using MapReduce and with massively parallel processing. The overall processing happens in two phases:
    • Map phase: In this phase, the Mapper job is relatively simple—it cleanses all of the input and creates key-value pairs for further processing. User will supply the information pertaining to user in <key, value> form for the Map Task. So, a Map Task will only pick relevant attributes in this case, which would matter for further processing, such as UserName and City.
    • Reduce phase: This is the second stage, where the processed <key, value> pair is reduced to a smaller set. The Reducer will receive information directly from Map Task. As you can see in the following screenshot, the reduce task performs the majority of operations; in this case, it reads the tuples and creates intermediate files process. Once the ...

Table of contents

  1. Title Page
  2. Copyright and Credits
  3. Dedication
  4. Packt Upsell
  5. Contributors
  6. Preface
  7. Hadoop 3.0 - Background and Introduction
  8. Planning and Setting Up Hadoop Clusters
  9. Deep Dive into the Hadoop Distributed File System
  10. Developing MapReduce Applications
  11. Building Rich YARN Applications
  12. Monitoring and Administration of a Hadoop Cluster
  13. Demystifying Hadoop Ecosystem Components
  14. Advanced Topics in Apache Hadoop
  15. Other Books You May Enjoy