eBook - ePub

Mastering Hadoop 3

Name: Mastering Hadoop 3
Author: Chanchal Singh, Manish Kumar

Big data processing at scale to unlock unique business insights

Chanchal Singh, Manish Kumar

Buch teilen

544 Seiten
English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Mastering Hadoop 3

Big data processing at scale to unlock unique business insights

Chanchal Singh, Manish Kumar

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

A comprehensive guide to mastering the most advanced Hadoop 3 concepts

Key Features

Get to grips with the newly introduced features and capabilities of Hadoop 3
Crunch and process data using MapReduce, YARN, and a host of tools within the Hadoop ecosystem
Sharpen your Hadoop skills with real-world case studies and code

Book Description

Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency.

With this guide, you'll understand advanced concepts of the Hadoop ecosystem tool. You'll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You'll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you'll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals.

By the end of this book, you'll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you'll be equipped to tackle a range of real-world problems in data pipelines.

What you will learn

Gain an in-depth understanding of distributed computing using Hadoop 3
Develop enterprise-grade applications using Apache Spark, Flink, and more
Build scalable and high-performance Hadoop data pipelines with security, monitoring, and data governance
Explore batch data processing patterns and how to model data in Hadoop
Master best practices for enterprises using, or planning to use, Hadoop 3 as a data platform
Understand security aspects of Hadoop, including authorization and authentication

Who this book is for

If you want to become a big data professional by mastering the advanced concepts of Hadoop, this book is for you. You'll also find this book useful if you're a Hadoop professional looking to strengthen your knowledge of the Hadoop ecosystem. Fundamental knowledge of the Java programming language and basics of Hadoop is necessary to get started with this book.

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Mastering Hadoop 3 als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Mastering Hadoop 3 von Chanchal Singh, Manish Kumar im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Computer Science & Data Processing. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Packt Publishing

Jahr

2019

ISBN

9781788628327

Auflage

Thema

Computer Science

Thema

Data Processing

Section 1: Introduction to Hadoop 3

This section will help you to understand Hadoop 3's features and provides a detailed explanation of the Hadoop Distributed File System (HDFS), YARN, and MapReduce jobs.

This section consists of the following chapters:

Chapter 1, Journey to Hadoop 3
Chapter 2, Deep Dive into the Hadoop Distributed File System
Chapter 3, YARN Resource Management in Hadoop
Chapter 4, Internals of MapReduce

Journey to Hadoop 3

Hadoop has come a long way since its inception. Powered by a community of open source enthusiasts, it has seen three major version releases. The version 1 release saw the light of day six years after the first release of Hadoop. With this release, the Hadoop platform had full capabilities that can run MapReduce-distributed computing on Hadoop Distributed File System (HDFS) distributed storage. It had some of the most major performance improvements ever done, along with full support for security. This release also enjoyed a lot of improvements with respect to HBASE.

The version 2 release made significant leaps compared to version 1 of Hadoop. It introduced YARN, a sophisticated general-purpose resource manager and job scheduling component. HDFS high availability, HDFS federations, and HDFS snapshots were some other prominent features introduced in version 2 releases.

The latest major release of Hadoop is version 3. This version has seen some significant features such as HDFS erasure encoding, a new YARN Timeline service (with new architecture), YARN opportunistic containers and distributed scheduling, support for three name nodes, and intra-data-node load balancers. Apart from major feature additions, version 3 has performance improvements and bug fixes. As this book is about mastering Hadoop 3, we'll mostly talk about this version.

In this chapter, we will take a look at Hadoop's history and how the Hadoop evolution timeline looks. We will look at the features of Hadoop 3 and get a logical view of the Hadoop ecosystem along with different Hadoop distributions.

In particular, we will cover the following topics:

Hadoop origins
Hadoop Timelines
Hadoop logical view
Moving towards Hadoop 3
Hadoop distributions

Hadoop origins and Timelines

Hadoop is changing the way people think about data. We need to know what led to the origin of this magical innovation. Who developed Hadoop and why? What problems existed before Hadoop? How has it solved these problems? What challenges were encountered during development? How has Hadoop transformed from version 1 to version 3? Let's walk through the origins of Hadoop and its journey to version 3.

Origins

In 1997, Doug Cutting, a co-founder of Hadoop, started working on project Lucene, which is a full-text search library. It was completely written in Java and is a full-text search engine. It analyzes text and builds an index on it. An index is just a mapping of text to locations, so it quickly gives all locations matching particular search patterns. After a few years, Doug made the Lucene project open source; it got a tremendous response from the community and it later became the Apache foundation project.

Once Doug realized that he had enough people who can look into Lucene, he started focusing on indexing web pages. Mike Cafarella joined him for this project to develop a product that can index web pages, and they named this project Apache Nutch. Apache Nutch was also known to be a subproject of Apache Lucene, as Nutch uses the Lucene library to index the content of web pages. Fortunately, with hard work, they made good progress and deployed Nutch on a single machine that was able to index around 100 pages per second.

Scalability is something that people often don't consider while developing initial versions of applications. This was also true of Doug and Mike and the number of web pages that could be indexed was limited to 100 million. In order to index more pages, they increased the number of machines. However, increasing nodes resulted in operational problems because they did not have any underlying cluster manager to perform operational tasks. They wanted to focus more on optimizing and developing robust Nutch applications without worrying about scalability issues.

Doug and Mike wanted a system that had the following features:

Fault tolerant: The system should be able to handle any failure of the machines automatically, in an isolated manner. This means the failure of one machine should not affect the entire application.
Load balancing: If one machine fails, then its work should be distributed automatically to the working machines in a fair manner.
Data loss: They also wanted to make sure that, once data is written to disk, it should never be lost even if one or two machines fail.

They started working on developing a system that can fulfill the aforementioned requirements and spent a few months doing so. However, at the same time, Google published its Google File System. When they read about it, they found it had solutions to similar problems they were trying to solve. They decided to make an implementation based on this research paper and started the development of Nutch Distributed File System (NDFS), which they completed in 2004.

With the help of the Google File System, they solved the scalability and fault tolerance problem that we discussed previously. They used the concept of blocks and replication to do so. Blocks are created by splitting each file into 64 MB chunks (the size is configurable) and replicating each block three times by default so that, if a machine holding one block fails, then data can be served from another machine. The implementation helped them solve all the operational problems they were trying to solve for Apache Nutch. The next section explains the origin of MapReduce.

MapReduce origin

Doug and Mike started working on an algorithm that can process data stored on NDFS. They wanted a system whose performance can be doubled by just doubling the number of machines running the program. At the same time, Google published MapReduce: Simplified Data Processing on Large Clusters (https://research.google.com/archive/mapreduce.html).

The core idea behind the MapReduce model was to provide parallelism, fault tolerance, and data locality features. Data locality means a program is executed where data is stored instead of bringing the data to the program. MapReduce was integrated into Nutch in 2005. In 2006, Doug created a new incubating project that consisted of HDFS (Hadoop Distributed File System), named after NDFS, MapReduce, and Hadoop Common.

At that time, Yahoo! was struggling with its backend search performance. Engineers at Yahoo! already knew the benefits of Google File System and MapReduce implemented at Google. Yahoo! decided to adopt the capability of Hadoop and they employed Doug to help their engineering team to do so. In 2007, a few more companies who started contributing to Hadoop and Yahoo! reported that they were running 1,000 node Hadoop clusters at the same time.

NameNodes and DataNodes have a specific role in managing overall clusters. NameNodes are responsible for maintaining metadata information. MapReduce engines have a job tracker and task tracker whose scalability is limited to 40,000 nodes because the overall work of scheduling and tracking is handled by only the job tracker. YARN was introduced in Hadoop version 2 to overcome scalability issues and resource management jobs. It gave Hadoop a new lease of life and Hadoop became a more robust, faster, and more scalable system.

Timelines

We will talk about MapReduce and HDFS in detail later. Let's go through the evolution of Hadoop, which looks as follows:

Year	Event
2003	Research paper for Google File System released
2004	Research paper for MapReduce released
2006	JIRA, mailing list, and other documents created for Hadoop Hadoop Nutch created Hadoop created by moving out NDFS and MapReduce from Nutch Doug Cutting names the project Hadoop, which was the name of his son's yellow elephant toy Release of Hadoop 0.1.0 1.8 TB of data sorts on 188 nodes, which took 47.9 hours Three hundred machines deployed at Yahoo! for the Hadoop cluster Cluster size at Yahoo! increases to 600
2007	Two clusters of 1,000 machines run by Yahoo! Hadoop released with HBase Apache pig created at Yahoo!
2008	JIRA for YARN opened Twenty companies listed on the Powered by Hadoop page Web index at Yahoo! moved to Hadoop 10,000-core Hadoop cluster used to generate Yahoo!'s production search index First ever Hadoop summit World record created for the fastest sorting (one terabyte of data in 209 seconds) by using 910 node Hadoop clusters Hadoop bags record for terabyte sort Hadoop bags record for terabyte sort benchmark The Yahoo! cluster now has 10 TB loaded to Hadoop every day Cloudera founded as a Hadoop distributi...