Big Data Systems
A 360-degree Approach
Jawwad Ahmed Shamsi, Muhammad Ali Khojaye
- 320 páginas
- English
- ePUB (apto para móviles)
- Disponible en iOS y Android
Big Data Systems
A 360-degree Approach
Jawwad Ahmed Shamsi, Muhammad Ali Khojaye
Información del libro
Big Data Systems encompass massive challenges related to data diversity, storage mechanisms, and requirements of massive computational power. Further, capabilities of big data systems also vary with respect to type of problems. For instance, distributed memory systems are not recommended for iterative algorithms. Similarly, variations in big data systems also exist related to consistency and fault tolerance. The purpose of this book is to provide a detailed explanation of big data systems. The book covers various topics including Networking, Security, Privacy, Storage, Computation, Cloud Computing, NoSQL and NewSQL systems, High Performance Computing, and Deep Learning. An illustrative and practical approach has been adopted in which theoretical topics have been aided by well-explained programming and illustrative examples.
Key Features:
- Introduces concepts and evolution of Big Data technology.
- Illustrates examples for thorough understanding.
- Contains programming examples for hands on development.
- Explains a variety of topics including NoSQL Systems, NewSQL systems, Security, Privacy, Networking, Cloud, High Performance Computing, and Deep Learning.
- Exemplifies widely used big data technologies such as Hadoop and Spark.
- Includes discussion on case studies and open issues.
- Provides end of chapter questions for enhanced learning.
Preguntas frecuentes
Información
II
Storage and Processing for Big Data
CHAPTER 4
HADOOP: An Efficient Platform for Storing and Processing Big Data
- 4.1 Requirements for Processing and Storing Big Data
- 4.2 Hadoop – The Big Picture
- 4.3 Hadoop Distributed File System
- 4.3.1 Benefits of Using HDFS
- 4.3.2 Scalability of HDFS
- 4.3.3 Size of Block
- 4.3.4 Cluster Management
- 4.3.5 Read and Write Operations
- 4.3.6 Checkpointing and Failure Recovery
- 4.3.7 HDFS Examples
- 4.4 MapReduce
- 4.4.1 MapReduce Operation
- 4.4.2 Input Output
- 4.4.3 The Partitioner Function
- 4.4.4 Sorting by Keys
- 4.4.5 The Combiner Function
- 4.4.6 Counting Items
- 4.4.7 Secondary Sorting
- 4.4.8 Inverted Indexing
- 4.4.9 Computing Inlinks and Outlinks
- 4.4.10 Join Operations Using MapReduce
- 4.4.11 MapReduce for Iterative Jobs
- 4.5 HBase
- 4.5.1 HBase and Hadoop
- 4.5.2 HBase Architecture
- 4.5.3 Installing HBase
- 4.5.4 HBase and Relational Databases
- 4.6 Concluding Remarks
- 4.7 Further Reading
- 4.8 Exercise Questions
4.1 REQUIREMENTS FOR PROCESSING AND STORING BIG DATA
- Scalability: A scalable solution is needed which can meet the growing demands of big data systems.
- Low Network Cost: Cost (time) to transfer data should be low. Predominantly, time spent on computation should be more than the time spent on transferring the data.
- Efficient Computation: Owing to a large amount of data, computation should be done in parallel in order to reduce time of computation.
- Fast and Rapid Retrieval: As big data systems are based on the principle of ‘write once read many’, retrieval of data should be fast.
- Fault Tolerance: Network and hardware failures are eminent. A big data storage and computation system should be fault tolerant.
4.2 HADOOP – THE BIG PICTURE
Component | Purpose |
---|---|
HDFS (Hadoop Distributed File System) | Distributed file system that can support high-throughput for accessing data. |
MapReduce | Programming platform which utilizes data stored in HDFS. |
Hive | A infrastructure for querying and data warehousing solutions for Hadoop. |
HBase | Is a distributed, scalable, NoSQL column family storage system built on top of HDFS. |
Pig | Scripting language for accessing data stored on HDFS. |
Mahout | Machine learning library that is built on top of MapReduce. |
Oozie | Scheduler and workflow engine for creating MapReduce jobs. |
Zookeeper | Tool to manage and synchronize configuration. |
Ambari | Tool for monitoring the Hadoop cluster. |
Impala | Massively parallel distributed database engine which can utilize Hadoop worker nodes for processing of queries. |