eBook - ePub

Big Data Systems

Name: Big Data Systems
Author: Jawwad Ahmed Shamsi, Muhammad Ali Khojaye

A 360-degree Approach

Jawwad Ahmed Shamsi, Muhammad Ali Khojaye

Compartir libro

320 páginas
English
ePUB (apto para móviles)
Disponible en iOS y Android

eBook - ePub

Big Data Systems

A 360-degree Approach

Jawwad Ahmed Shamsi, Muhammad Ali Khojaye

Detalles del libro

Vista previa del libro

Índice

Citas

Información del libro

Big Data Systems encompass massive challenges related to data diversity, storage mechanisms, and requirements of massive computational power. Further, capabilities of big data systems also vary with respect to type of problems. For instance, distributed memory systems are not recommended for iterative algorithms. Similarly, variations in big data systems also exist related to consistency and fault tolerance. The purpose of this book is to provide a detailed explanation of big data systems. The book covers various topics including Networking, Security, Privacy, Storage, Computation, Cloud Computing, NoSQL and NewSQL systems, High Performance Computing, and Deep Learning. An illustrative and practical approach has been adopted in which theoretical topics have been aided by well-explained programming and illustrative examples.

Key Features:

Introduces concepts and evolution of Big Data technology.
Illustrates examples for thorough understanding.
Contains programming examples for hands on development.
Explains a variety of topics including NoSQL Systems, NewSQL systems, Security, Privacy, Networking, Cloud, High Performance Computing, and Deep Learning.
Exemplifies widely used big data technologies such as Hadoop and Spark.
Includes discussion on case studies and open issues.
Provides end of chapter questions for enhanced learning.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?

Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.

¿Cómo descargo los libros?

Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.

¿En qué se diferencian los planes de precios?

Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.

¿Qué es Perlego?

Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.

¿Perlego ofrece la función de texto a voz?

Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.

¿Es Big Data Systems un PDF/ePUB en línea?

Sí, puedes acceder a Big Data Systems de Jawwad Ahmed Shamsi, Muhammad Ali Khojaye en formato PDF o ePUB, así como a otros libros populares de Economics y Statistics for Business & Economics. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Editorial

Chapman and Hall/CRC

Año

2021

ISBN

9780429531576

Edición

Categoría

Economics

Categoría

Statistics for Business & Economics

II

Storage and Processing for Big Data

CHAPTER 4

HADOOP: An Efficient Platform for Storing and Processing Big Data

CONTENTS

4.1 Requirements for Processing and Storing Big Data
4.2 Hadoop – The Big Picture
4.3 Hadoop Distributed File System
- 4.3.1 Benefits of Using HDFS
- 4.3.2 Scalability of HDFS
- 4.3.3 Size of Block
- 4.3.4 Cluster Management
- 4.3.5 Read and Write Operations
- 4.3.6 Checkpointing and Failure Recovery
- 4.3.7 HDFS Examples
4.4 MapReduce
- 4.4.1 MapReduce Operation
- 4.4.2 Input Output
- 4.4.3 The Partitioner Function
- 4.4.4 Sorting by Keys
- 4.4.5 The Combiner Function
- 4.4.6 Counting Items
- 4.4.7 Secondary Sorting
- 4.4.8 Inverted Indexing
- 4.4.9 Computing Inlinks and Outlinks
- 4.4.10 Join Operations Using MapReduce
- 4.4.11 MapReduce for Iterative Jobs
4.5 HBase
- 4.5.1 HBase and Hadoop
- 4.5.2 HBase Architecture
- 4.5.3 Installing HBase
- 4.5.4 HBase and Relational Databases
4.6 Concluding Remarks
4.7 Further Reading
4.8 Exercise Questions

IN this chapter, we will study efficient platforms for storing and processing big data. Our main focus will be on MapReduce/Hadoop – a widely used platform for batch processing on big data. Finally, we will learn about HBase – a distributed, scalable, big data store built on top of HDFS.

4.1 REQUIREMENTS FOR PROCESSING AND STORING BIG DATA

Big data systems have been mainly used for batch processing systems with focus on analytics. Since the amount of data is quite large, special considerations are needed:

Scalability: A scalable solution is needed which can meet the growing demands of big data systems.
Low Network Cost: Cost (time) to transfer data should be low. Predominantly, time spent on computation should be more than the time spent on transferring the data.
Efficient Computation: Owing to a large amount of data, computation should be done in parallel in order to reduce time of computation.
Fast and Rapid Retrieval: As big data systems are based on the principle of ‘write once read many’, retrieval of data should be fast.
Fault Tolerance: Network and hardware failures are eminent. A big data storage and computation system should be fault tolerant.

The above mentioned requirements are specific to big data systems. In addition to these requirements, a big data system should provide a flexible and adaptable platform for programming and development.

Over the past two decades, big data systems have evolved on these principles. With time, their capacity and performance have also increased.

We will now discuss Hadoop, one of the most widely used big data platform, which can cater these challenges.

4.2 HADOOP – THE BIG PICTURE

Apache Hadoop is a collection of open-source software and platform for big data. The strength of Hadoop stems in from the fact that the framework can scale to hundreds or thousands of nodes comprising commodity hardware. The Hadoop framework is able to recover from issues such as node failure and disk failure. It can provide consistency, replication, and monitoring at the application layer. Hadoop has several components which are responsible for storing, monitoring, and processing at different layers. Table 4.1 lists a few major components of the framework.

TABLE 4.1 Components in Hadoop v1
Component	Purpose
HDFS (Hadoop Distributed File System)	Distributed file system that can support high-throughput for accessing data.
MapReduce	Programming platform which utilizes data stored in HDFS.
Hive	A infrastructure for querying and data warehousing solutions for Hadoop.
HBase	Is a distributed, scalable, NoSQL column family storage system built on top of HDFS.
Pig	Scripting language for accessing data stored on HDFS.
Mahout	Machine learning library that is built on top of MapReduce.
Oozie	Scheduler and workflow engine for creating MapReduce jobs.
Zookeeper	Tool to manage and synchronize configuration.
Ambari	Tool for monitoring the Hadoop cluster.
Impala	Massively parallel distributed database engine which can utilize Hadoop worker nodes for processing of queries.

Figure 4.1 illustrates a layered model of the Hadoop echo system. Each of these components have a unique feature. For instance, HDFS provides reliable storage for many Hadoop jobs, whereas MapReduce serves as the standard programming model. MapReduce performs computation jobs on the data stored by the HDFS. Both MapReduce and HDFS are combined to promote data locality – a concept in which computation is performed locally to each node where data is stored. Data locality reduces network cost. On top of MapReduce are Pig and Hive components of Hadoop. These are extensions of MapReduce, which are used for providing scripting-based access for querying and processing data. Mahout is a MapReduce library, which implements machine learning algorithms. HBase is a NoSQL column-oriented database, which is built on top of HDFS, whereas Oozie is a workflow Engine for MapReduce jobs. The purpose of Zookeeper is to manage and synchronize configuration.

In chapter 5, we will study these components in detail.

4.3 Hadoop Distributed File System

Figure 4.2 shows the architecture of a Hadoop Distributed File System (HDFS). It provides storage for Hadoop through a cluster of nodes. Data in HDFS is stored in the form of blocks, which are distributed and replicated across the cluster. The namenode stores metadata or information about blocks, whereas data is stored on datanodes. Since ther...