Big Data Systems
eBook - ePub

Big Data Systems

A 360-degree Approach

  1. 320 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Big Data Systems

A 360-degree Approach

About this book

Big Data Systems encompass massive challenges related to data diversity, storage mechanisms, and requirements of massive computational power. Further, capabilities of big data systems also vary with respect to type of problems. For instance, distributed memory systems are not recommended for iterative algorithms. Similarly, variations in big data systems also exist related to consistency and fault tolerance. The purpose of this book is to provide a detailed explanation of big data systems. The book covers various topics including Networking, Security, Privacy, Storage, Computation, Cloud Computing, NoSQL and NewSQL systems, High Performance Computing, and Deep Learning. An illustrative and practical approach has been adopted in which theoretical topics have been aided by well-explained programming and illustrative examples.

Key Features:

  • Introduces concepts and evolution of Big Data technology.
  • Illustrates examples for thorough understanding.
  • Contains programming examples for hands on development.
  • Explains a variety of topics including NoSQL Systems, NewSQL systems, Security, Privacy, Networking, Cloud, High Performance Computing, and Deep Learning.
  • Exemplifies widely used big data technologies such as Hadoop and Spark.
  • Includes discussion on case studies and open issues.
  • Provides end of chapter questions for enhanced learning.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Big Data Systems by Jawwad Ahmad Shamsi,Jawwad Ahmed Shamsi,Muhammad Ali Khojaye in PDF and/or ePUB format, as well as other popular books in Economics & Statistics for Business & Economics. We have over one million books available in our catalogue for you to explore.

II

Storage and Processing for Big Data

CHAPTER 4

HADOOP: An Efficient Platform for Storing and Processing Big Data

CONTENTS
  • 4.1 Requirements for Processing and Storing Big Data
  • 4.2 Hadoop – The Big Picture
  • 4.3 Hadoop Distributed File System
    • 4.3.1 Benefits of Using HDFS
    • 4.3.2 Scalability of HDFS
    • 4.3.3 Size of Block
    • 4.3.4 Cluster Management
    • 4.3.5 Read and Write Operations
    • 4.3.6 Checkpointing and Failure Recovery
    • 4.3.7 HDFS Examples
  • 4.4 MapReduce
    • 4.4.1 MapReduce Operation
    • 4.4.2 Input Output
    • 4.4.3 The Partitioner Function
    • 4.4.4 Sorting by Keys
    • 4.4.5 The Combiner Function
    • 4.4.6 Counting Items
    • 4.4.7 Secondary Sorting
    • 4.4.8 Inverted Indexing
    • 4.4.9 Computing Inlinks and Outlinks
    • 4.4.10 Join Operations Using MapReduce
    • 4.4.11 MapReduce for Iterative Jobs
  • 4.5 HBase
    • 4.5.1 HBase and Hadoop
    • 4.5.2 HBase Architecture
    • 4.5.3 Installing HBase
    • 4.5.4 HBase and Relational Databases
  • 4.6 Concluding Remarks
  • 4.7 Further Reading
  • 4.8 Exercise Questions
IN this chapter, we will study efficient platforms for storing and processing big data. Our main focus will be on MapReduce/Hadoop – a widely used platform for batch processing on big data. Finally, we will learn about HBase – a distributed, scalable, big data store built on top of HDFS.

4.1 REQUIREMENTS FOR PROCESSING AND STORING BIG DATA

Big data systems have been mainly used for batch processing systems with focus on analytics. Since the amount of data is quite large, special considerations are needed:
  1. Scalability: A scalable solution is needed which can meet the growing demands of big data systems.
  2. Low Network Cost: Cost (time) to transfer data should be low. Predominantly, time spent on computation should be more than the time spent on transferring the data.
  3. Efficient Computation: Owing to a large amount of data, computation should be done in parallel in order to reduce time of computation.
  4. Fast and Rapid Retrieval: As big data systems are based on the principle of ‘write once read many’, retrieval of data should be fast.
  5. Fault Tolerance: Network and hardware failures are eminent. A big data storage and computation system should be fault tolerant.
The above mentioned requirements are specific to big data systems. In addition to these requirements, a big data system should provide a flexible and adaptable platform for programming and development.
Over the past two decades, big data systems have evolved on these principles. With time, their capacity and performance have also increased.
We will now discuss Hadoop, one of the most widely used big data platform, which can cater these challenges.

4.2 HADOOP – THE BIG PICTURE

Apache Hadoop is a collection of open-source software and platform for big data. The strength of Hadoop stems in from the fact that the framework can scale to hundreds or thousands of nodes comprising commodity hardware. The Hadoop framework is able to recover from issues such as node failure and disk failure. It can provide consistency, replication, and monitoring at the application layer. Hadoop has several components which are responsible for storing, monitoring, and processing at different layers. Table 4.1 lists a few major components of the framework.
TABLE 4.1 Components in Hadoop v1
Component
Purpose
HDFS (Hadoop Distributed File System) Distributed file system that can support high-throughput for accessing data.
MapReduce Programming platform which utilizes data stored in HDFS.
Hive A infrastructure for querying and data warehousing solutions for Hadoop.
HBase Is a distributed, scalable, NoSQL column family storage system built on top of HDFS.
Pig Scripting language for accessing data stored on HDFS.
Mahout Machine learning library that is built on top of MapReduce.
Oozie Scheduler and workflow engine for creating MapReduce jobs.
Zookeeper Tool to manage and synchronize configuration.
Ambari Tool for monitoring the Hadoop cluster.
Impala Massively parallel distributed database engine which can utilize Hadoop worker nodes for processing of queries.
Figure 4.1 illustrates a layered model of the Hadoop echo system. Each of these components have a unique feature. For instance, HDFS provides reliable storage for many Hadoop jobs, whereas MapReduce serves as the standard programming model. MapReduce performs computation jobs on the data stored by the HDFS. Both MapReduce and HDFS are combined to promote data locality – a concept in which computation is performed locally to each node where data is stored. Data locality reduces network cost. On top of MapReduce are Pig and Hive components of Hadoop. These are extensions of MapReduce, which are used for providing scripting-based access for querying and processing data. Mahout is a MapReduce library, which implements machine learning algorithms. HBase is a NoSQL column-oriented database, which is built on top of HDFS, whereas Oozie is a workflow Engine for MapReduce jobs. The purpose of Zookeeper is to manage and synchronize configuration.
In chapter 5, we will study these components in detail.
FIGURE 4.1
FIGURE 4.1 Hadoop echo system

4.3 Hadoop Distributed File System

Figure 4.2 shows the architecture of a Hadoop Distributed File System (HDFS). It provides storage for Hadoop through a cluster of nodes. Data in HDFS is stored in the form of blocks, which are distributed and replicated across the cluster. The namenode stores metadata or information about blocks, whereas data is stored on datanodes. Since ther...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright Page
  5. Dedication
  6. Contents
  7. Preface
  8. Author Bios
  9. Acknowledgments
  10. List of Examples
  11. List of Figures
  12. List of Tables
  13. SECTION I Introduction
  14. SECTION II Storage and Processing for Big Data
  15. SECTION III Networking, Security, and Privacy for Big Data
  16. SECTION IV Computation for Big Data
  17. SECTION V Case Studies and Future Trends
  18. Bibliography
  19. Index