eBook - ePub

Big Data Systems

Name: Big Data Systems
ISBN: 9780429531576

A 360-degree Approach

Jawwad Ahmad Shamsi,

Jawwad Ahmed Shamsi,

Muhammad Ali Khojaye,

320 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Big Data Systems

A 360-degree Approach

Jawwad Ahmad Shamsi,

Jawwad Ahmed Shamsi,

Muhammad Ali Khojaye,

About this book

Big Data Systems encompass massive challenges related to data diversity, storage mechanisms, and requirements of massive computational power. Further, capabilities of big data systems also vary with respect to type of problems. For instance, distributed memory systems are not recommended for iterative algorithms. Similarly, variations in big data systems also exist related to consistency and fault tolerance. The purpose of this book is to provide a detailed explanation of big data systems. The book covers various topics including Networking, Security, Privacy, Storage, Computation, Cloud Computing, NoSQL and NewSQL systems, High Performance Computing, and Deep Learning. An illustrative and practical approach has been adopted in which theoretical topics have been aided by well-explained programming and illustrative examples.

Key Features:

Introduces concepts and evolution of Big Data technology.
Illustrates examples for thorough understanding.
Contains programming examples for hands on development.
Explains a variety of topics including NoSQL Systems, NewSQL systems, Security, Privacy, Networking, Cloud, High Performance Computing, and Deep Learning.
Exemplifies widely used big data technologies such as Hadoop and Spark.
Includes discussion on case studies and open issues.
Provides end of chapter questions for enhanced learning.

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Chapman and Hall/CRC

Year

2021

Print ISBN

9780367755232

eBook ISBN

9780429531576

Topic

Computer Science

Subtopic

Statistics for Business & Economics

Index

Computer Science

II

Storage and Processing for Big Data

CHAPTER 4 HADOOP: An Efficient Platform for Storing and Processing Big Data

CONTENTS

4.1 Requirements for Processing and Storing Big Data
4.2 Hadoop – The Big Picture
4.3 Hadoop Distributed File System
- 4.3.1 Benefits of Using HDFS
- 4.3.2 Scalability of HDFS
- 4.3.3 Size of Block
- 4.3.4 Cluster Management
- 4.3.5 Read and Write Operations
- 4.3.6 Checkpointing and Failure Recovery
- 4.3.7 HDFS Examples
4.4 MapReduce
- 4.4.1 MapReduce Operation
- 4.4.2 Input Output
- 4.4.3 The Partitioner Function
- 4.4.4 Sorting by Keys
- 4.4.5 The Combiner Function
- 4.4.6 Counting Items
- 4.4.7 Secondary Sorting
- 4.4.8 Inverted Indexing
- 4.4.9 Computing Inlinks and Outlinks
- 4.4.10 Join Operations Using MapReduce
- 4.4.11 MapReduce for Iterative Jobs
4.5 HBase
- 4.5.1 HBase and Hadoop
- 4.5.2 HBase Architecture
- 4.5.3 Installing HBase
- 4.5.4 HBase and Relational Databases
4.6 Concluding Remarks
4.7 Further Reading
4.8 Exercise Questions

IN this chapter, we will study efficient platforms for storing and processing big data. Our main focus will be on MapReduce/Hadoop – a widely used platform for batch processing on big data. Finally, we will learn about HBase – a distributed, scalable, big data store built on top of HDFS.

4.1 REQUIREMENTS FOR PROCESSING AND STORING BIG DATA

Big data systems have been mainly used for batch processing systems with focus on analytics. Since the amount of data is quite large, special considerations are needed:

Scalability: A scalable solution is needed which can meet the growing demands of big data systems.
Low Network Cost: Cost (time) to transfer data should be low. Predominantly, time spent on computation should be more than the time spent on transferring the data.
Efficient Computation: Owing to a large amount of data, computation should be done in parallel in order to reduce time of computation.
Fast and Rapid Retrieval: As big data systems are based on the principle of ‘write once read many’, retrieval of data should be fast.
Fault Tolerance: Network and hardware failures are eminent. A big data storage and computation system should be fault tolerant.

The above mentioned requirements are specific to big data systems. In addition to these requirements, a big data system should provide a flexible and adaptable platform for programming and development.

Over the past two decades, big data systems have evolved on these principles. With time, their capacity and performance have also increased.

We will now discuss Hadoop, one of the most widely used big data platform, which can cater these challenges.

4.2 HADOOP – THE BIG PICTURE

Apache Hadoop is a collection of open-source software and platform for big data. The strength of Hadoop stems in from the fact that the framework can scale to hundreds or thousands of nodes comprising commodity hardware. The Hadoop framework is able to recover from issues such as node failure and disk failure. It can provide consistency, replication, and monitoring at the application layer. Hadoop has several components which are responsible for storing, monitoring, and processing at different layers. Table 4.1 lists a few major components of the framework.

TABLE 4.1 Components in Hadoop v1
Component	Purpose
HDFS (Hadoop Distributed File System)	Distributed file system that can support high-throughput for accessing data.
MapReduce	Programming platform which utilizes data stored in HDFS.
Hive	A infrastructure for querying and data warehousing solutions for Hadoop.
HBase	Is a distributed, scalable, NoSQL column family storage system built on top of HDFS.
Pig	Scripting language for accessing data stored on HDFS.
Mahout	Machine learning library that is built on top of MapReduce.
Oozie	Scheduler and workflow engine for creating MapReduce jobs.
Zookeeper	Tool to manage and synchronize configuration.
Ambari	Tool for monitoring the Hadoop cluster.
Impala	Massively parallel distributed database engine which can utilize Hadoop worker nodes for processing of queries.

Figure 4.1 illustrates a layered model of the Hadoop echo system. Each of these components have a unique feature. For instance, HDFS provides reliable storage for many Hadoop jobs, whereas MapReduce serves as the standard programming model. MapReduce performs computation jobs on the data stored by the HDFS. Both MapReduce and HDFS are combined to promote data locality – a concept in which computation is performed locally to each node where data is stored. Data locality reduces network cost. On top of MapReduce are Pig and Hive components of Hadoop. These are extensions of MapReduce, which are used for providing scripting-based access for querying and processing data. Mahout is a MapReduce library, which implements machine learning algorithms. HBase is a NoSQL column-oriented database, which is built on top of HDFS, whereas Oozie is a workflow Engine for MapReduce jobs. The purpose of Zookeeper is to manage and synchronize configuration.

In chapter 5, we will study these components in detail.

4.3 Hadoop Distributed File System

Figure 4.2 shows the architecture of a Hadoop Distributed File System (HDFS). It provides storage for Hadoop through a cluster of nodes. Data in HDFS is stored in the form of blocks, which are distributed and replicated across the cluster. The namenode stores metadata or information about blocks, whereas data is stored on datanodes. Since ther...

Cover
Half Title
Title Page
Copyright Page
Dedication
Contents
Preface
Author Bios
Acknowledgments
List of Examples
List of Figures
List of Tables
SECTION I Introduction
SECTION II Storage and Processing for Big Data
SECTION III Networking, Security, and Privacy for Big Data
SECTION IV Computation for Big Data
SECTION V Case Studies and Future Trends
Bibliography
Index

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

What is the difference between the pricing plans?

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.5M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Big Data Systems an online PDF/ePUB?

Yes, you can access Big Data Systems by Jawwad Ahmad Shamsi,Jawwad Ahmed Shamsi,Muhammad Ali Khojaye in PDF and/or ePUB format, as well as other popular books in Computer Science & Statistics for Business & Economics. We have over 1.5 million books available in our catalogue for you to explore.

Big Data Systems

A 360-degree Approach

Big Data Systems

A 360-degree Approach

About this book

Trusted by 375,005 students

Information

II

Storage and Processing for Big Data

CHAPTER 4

HADOOP: An Efficient Platform for Storing and Processing Big Data

4.1 REQUIREMENTS FOR PROCESSING AND STORING BIG DATA

4.2 HADOOP – THE BIG PICTURE

4.3 Hadoop Distributed File System

Table of contents

Frequently asked questions