Computer Science

Hadoop

Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers. It provides a reliable, scalable platform for big data analytics and supports the processing of structured and unstructured data. Hadoop's key components include the Hadoop Distributed File System (HDFS) for storage and MapReduce for data processing.

Written by Perlego with AI-assistance

11 Key excerpts on "Hadoop"

  • Book cover image for: Big Data Analytics
    eBook - PDF

    Big Data Analytics

    A Practical Guide for Managers

    73 3 Hadoop Hadoop is the primary standard for distributed computing for at least two reasons: (1) it has the power and the tools to manage distributed nodes and clusters, and (2) it is free from the Apache Foundation. MapReduce and Hadoop Distributed File System (HDFS) are the two different parts of Hadoop. Apache lists the following projects as related to Hadoop: 1 • Ambari: A web-based tool for provisioning, managing, and monitor-ing Apache Hadoop clusters, which includes support for HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop • Avro: A data serialization system • Cassandra: A scalable multimaster database with no single points of failure and excellent performance • Chukwa: A data collection system for managing large distributed systems • HBase: A scalable, distributed database that supports structured data storage for large tables • Hive: A data warehouse infrastructure that provides data summari-zation and ad hoc querying • Mahout: A scalable machine learning and data mining library • Pig: A high-level dataflow language and execution framework for parallel computation • ZooKeeper: A high-performance coordination service for distrib-uted applications Advantages: • We can distribute data and computation. • Tasks become independent; hence, the tasks are independent. 74 • Big Data Analytics • We can more easily handle partial failure (entire nodes can fail and restart). • We avoid crawling panics from failure and tolerant synchronous distributed systems. • We can use hypothetical implementation to work around “laggards.” • It has a simple programming model—the end-user programmer must only write MapReduce tasks. • The system has relatively flat scalability (adding more subsystems results in real improvements). The name Hadoop has gained ubiquity since Doug Cutting affixed the name of his son’s toy elephant to the application he created.
  • Book cover image for: Data Analytics: Principles, Tools, and Practices
    eBook - ePub

    Data Analytics: Principles, Tools, and Practices

    A Complete Guide for Advanced Data Analytics Using the Latest Trends, Tools, and Technologies (English Edition)

    • Dr. Gaurav Aroraa, Chitra Lele, Dr. Munish Jindal, Dr. Gaurav Aroraa, Chitra Lele, Dr. Munish Jindal(Authors)
    • 2022(Publication Date)
    • BPB Publications
      (Publisher)
    Because Hadoop is meant for big data analytics, its primary users are corporations with huge chunks of data to be analyzed and interpreted across multiple geographical locations. The International Data Corporation’s report shows that at least 32% of the organizations already have Hadoop in place, and another 36% are preparing to use it within the next year. Another report by Gartner, Inc. also forecasted 30% of enterprises to have already invested in Hadoop. Because Hadoop is flexible, additional data can be inserted, edited, or even deleted as per business process requirements. It offers cost-effective and practical solutions, and adding more storage units is hassle-free; readily—available storage from IT vendors can be easily procured. Off-shelf software systems are rigid and complex to customize, whereas Hadoop being open source, provides sufficient flexibility for organizations to tailor it the way they deem perfect. Also, the easy availability of commercial versions in the market simplifies the installation process of the Hadoop framework. All these reasons are strong enough to make Hadoop popular amongst organizations. The Apache Hadoop framework comprises the following main modules: Hadoop common Hadoop distributed file system (HDFS) Hadoop YARN Hadoop MapReduce Figure 6.8: Hadoop Framework Hadoop common It refers to the collection of libraries and other common tools that support other Hadoop modules. Hadoop common forms an integral part of the Apache Hadoop framework, along with the other modules— Hadoop distributed file system (HDFS), Hadoop YARN, and Hadoop MapReduce. It works with the assumption that hardware failures are pretty common and that these failures should be handled by default by Hadoop framework software. It contains the basic and essential Java archive (JAR) files and scripts that are required to start Hadoop
  • Book cover image for: Big Data and Hadoo - 2nd Edition
    eBook - ePub

    Big Data and Hadoo - 2nd Edition

    Fundamentals, tools, and techniques for data-driven success (English Edition)

    The objective of this chapter is to address the challenges of processing and analyzing massive data efficiently. It achieves scalability, empowering businesses to handle large datasets. Fault tolerance ensures uninterrupted processing by redirecting tasks during node failures. Distributed computing enables parallel processing across nodes, optimizing resource utilization. Data locality minimizes network transfers, improving performance and keeping costs low because moving computing is cheaper than moving data. Hadoop’s flexibility supports various processing models, enabling batch, interactive, and real-time processing. By focusing on scalability, fault tolerance, distributed computing, data locality, and flexibility, Hadoop revolutionizes data-intensive workloads. It empowers organizations to store, process, and analyze vast amounts of data, offering a solid foundation for big data processing in today’s data-driven world.
    Data distribution
    During a Hadoop cluster, the information is distributed to all or any of the nodes of the cluster because it is being loaded in. The Hadoop Distributed File System (HDFS ) can split massive data files into chunks, as shown in
    Figure 4.1 ,
    which is managed by completely different nodes in the cluster. Additionally, every chunk is replicated across many machines, which is called the replication factor, so one machine failure does not result in any data being unavailable. A lively observance system then re-replicates the information in response to system failures, which might lead to partial storage. Although the file chunks are replicated and distributed across many machines, they form one namespace; therefore, their contents are universally accessible.
    Figure 4.1: Distributing data
    Hadoop fixes the processing with communication through the processor, as every record is processed from clusters individually by a task that is given externally by the programmer. However, this only seems like a significant limitation initially, and it later makes the total framework more reliable.
  • Book cover image for: Big Data at Work
    eBook - PDF

    Big Data at Work

    Dispelling the Myths, Uncovering the Opportunities

    However, a few paragraphs here on big data structuring tools are worthwhile, simply because you as a manager will have to make deci-sions about whether or not to implement them within your organiza-tion. What’s new about big data technologies is primarily that the data can’t be handled well with traditional database software or with single servers. Traditional relational databases assume data in the form of neat rows and columns of numbers, and big data comes in a variety of diverse formats. Therefore, a new generation of data processing software has emerged to handle it. You’ll hear people talking often about Hadoop, an open-source software tool set and framework for dividing up data across multiple computers; it is a unified storage and processing environment that is highly scalable to large and complex data volumes. Hadoop is sometimes called Apache Hadoop, because the most common version of it is supported by The Apache Software Foundation. However, as tends to happen with open-source projects, many commercial vendors have created their own versions of Hadoop as well. There are Cloudera Hadoop, Hortonworks Hadoop, EMC Hadoop, Intel Hadoop, Microsoft Hadoop, and many more. One of the reasons Hadoop is necessary is that the volume of the big data means that it can’t be processed quickly on a single server, no matter how powerful. Splitting a computing task—say, an algorithm 116 big data @ work that compares many different photos to a specified photo to try to find a match—across multiple servers can reduce processing time by a hundredfold or more. Fortunately, the rise of big data coincides with the rise of inexpensive commodity servers with many—sometimes thousands of—computer processors. Another commonly used tool is MapReduce, a Google-developed framework for dividing big data pro-cessing across a group of linked computer nodes. Hadoop contains a version of MapReduce. These new technologies are by no means the only ones that organi-zations need to investigate.
  • Book cover image for: Practical Big Data Analytics
    • Nataraj Dasgupta, Giancarlo Zaccone, Patrick Hannah(Authors)
    • 2018(Publication Date)
    • Packt Publishing
      (Publisher)

    Big Data With Hadoop

    Hadoop has become the de facto standard in the world of big data, especially over the past three to four years. Hadoop started as a subproject of Apache Nutch in 2006 and introduced two key features related to distributed filesystems and distributed computing, also known as MapReduce, that caught on very rapidly among the open source community. Today, there are thousands of new products that have been developed leveraging the core features of Hadoop, and it has evolved into a vast ecosystem consisting of more than 150 related major products. Arguably, Hadoop was one of the primary catalysts that started the big data and analytics industry.
    In this chapter, we will discuss the background and core concepts of Hadoop, the components of the Hadoop platform, and delve deeper into the major products in the Hadoop ecosystem. We will learn about the core concepts of distributed filesystems and distributed processing and optimizations to improve the performance of Hadoop deployments. We'll conclude with real-world hands-on exercises using the Cloudera Distribution of Hadoop (CDH ). The topics we will cover are:
    • The basics of Hadoop
    • The core components of Hadoop
    • Hadoop 1 and Hadoop 2
    • The Hadoop Distributed File System
    • Distributed computing principles with MapReduce
    • The Hadoop ecosystem
    • Overview of the Hadoop ecosystem
    • Hive, HBase, and more
    • Hadoop Enterprise deployments
    • In-house deployments
    • Cloud deployments
    • Hands-on with Cloudera Hadoop
    • Using HDFS
    • Using Hive
    • MapReduce with WordCount
    Passage contains an image

    The fundamentals of Hadoop

    In 2006, Doug Cutting, the creator of Hadoop, was working at Yahoo!. He was actively engaged in an open source project called Nutch that involved the development of a large-scale web crawler. A web crawler at a high level is essentially software that can browse and index web pages, generally in an automatic manner, on the internet. Intuitively, this involves efficient management and computation across large volumes of data. In late January of 2006, Doug formally announced the start of Hadoop. The first line of the request, still available on the internet at https://issues.apache.org/jira/browse/INFRA-700, was The Lucene PMC has voted to split part of Nutch into a new subproject named Hadoop
  • Book cover image for: Big Data with Hadoop MapReduce
    eBook - ePub

    Big Data with Hadoop MapReduce

    A Classroom Approach

    • Rathinaraja Jeyaraj, Ganeshkumar Pugalendhi, Anand Paul(Authors)
    • 2020(Publication Date)
    CHAPTER 2

    Hadoop Framework

    The pessimist sees difficulty in every opportunity.
    The optimist sees opportunity in every difficulty.
    —Winston Churchill

    INTRODUCTION

    To write MapReduce job, we need to understand its execution flow and how data is framed in each step in the execution flow. Therefore, this chapters explains Hadoop distributed file system and MapReduce in detail and demonstrates with simple examples. Finally, we outline the shortcomings of MapReduce and some possible solutions to overcome.

    2.1 TERMINOLOGY

    It is essential to take a look at frequently used distributed computing terminologies before exploring more on Hadoop inner working concepts.
    File system – is a software that controls how data is stored and retrieved from disks and other storage devices. It manages storage disks with different data access pattern like file-based (NTFS, ext4), object-based (swift, S3), block-based (cinder, database).
    Distributed system – A group of networked heterogeneous/homogeneous computers that work together as a single unit to accomplish a task is called a distributed system. In simple words, a group of computers provides a single computer view. Example: Four computers each with dual core, 4 GB memory, and 1 TB storage in a cluster can be said as a distributed system with 8 cores, 16 GB memory, and 4 TB storage. The term “distributed” means that more than one computer is involved, where data/program can be moved from one computer to another computer.
    Distributed computing – is managing a set of processes across a cluster of machines/processors, and communicate/coordinate each other by exchanging messages to finish a task.
    Distributed storage – A collection of storage devices from different computers in a cluster providing single disk view is called distributed storage.
    Distributed File System (DFS)
  • Book cover image for: Machine Learning and Data Science
    eBook - PDF

    Machine Learning and Data Science

    Fundamentals and Applications

    • Prateek Agrawal, Charu Gupta, Anand Sharma, Vishu Madaan, Nisheeth Joshi, Prateek Agrawal, Charu Gupta, Anand Sharma, Vishu Madaan, Nisheeth Joshi(Authors)
    • 2022(Publication Date)
    • Wiley-Scrivener
      (Publisher)
    These data sets are structured, semi-structured, and unstructured. As Data Science has gained momentum, different platforms and tools are required to support it. The decision to opt for a particular tool or platform lies in the need of the organization and the type of data being generated. Different platforms have their advantages and limitations. The right com- bination of tools and infrastructure for processing Big Data can bring many fold advancements in an organization and provides an edge in the indus- try. However, the current scale of data makes it technically difficult to be processed on a single system. Distributed processing of the data becomes a viable solution where data is shared among multiple computing nodes to be processed in parallel [14]. One of the highly preferred choices for distrib- uted processing of large-scale data is Hadoop. It has emerged as a platform of choice for empowering Data Science. Hadoop serves as a scalable plat- form and an engine for performing computations. It is capable of providing abstraction while performing the distributed processing with ease. Table 9.3 Sources of Big Data. Airplane Black Box Social Media Online Transactions Research Data Stock Exchange Power Grids Transport Data Search Engine Data 154 Machine Learning and Data Science 9.2.1 Anatomy of the Hadoop Ecosystem Components of Hadoop are divided into major and supporting compo- nents. Figure 9.3 shows the basic architecture of Hadoop. Without major components, job processing cannot be accomplished while other compo- nents in Hadoop’s ecosystem provide add-on services to facilitate job pro- cessing. The major components are: • Yarn: Yarn is a managerial module of the Hadoop [15]. It takes care of the scheduling process, resource allocation, and utilization. In a distributed environment, the key challenge is to manage the resources. Yarn efficiently manages these resources in Hadoop. The resource manager negotiates the resources with the nodes to job demands.
  • Book cover image for: Big Data
    eBook - ePub

    Big Data

    Concepts, Technology, and Architecture

    • Balamurugan Balusamy, Nandhini Abirami R, Seifedine Kadry, Amir H. Gandomi(Authors)
    • 2021(Publication Date)
    • Wiley
      (Publisher)
    MapReduce is the programming paradigm of Hadoop. It can be used to write applications to process the massive data stored in Hadoop. Figure 2.1 Big data storage architecture. 2.1 Cluster Computing Cluster computing is a distributed or parallel computing system comprising multiple stand‐alone PCs connected together working as a single, integrated, highly available resource. Multiple computing resources are connected together in a cluster to constitute a single larger and more powerful virtual computer with each computing resource running an instance of the OS. The cluster components are connected together through local area networks (LANs). Cluster computing technology is used for high availability as well as load balancing with better system performance and reliability. The benefits of massively parallel processors and cluster computers are high availability, scalable performance, fault tolerance, and the use of cost‐effective commodity hardware. Scalability is achieved by removing nodes or adding additional nodes as per the demand without hindering the system operation. A cluster of systems connects together a group of systems to share critical computational tasks. The servers in a cluster are called nodes. Cluster computing can be client‐server architecture or a peer‐peer model. It provides high‐speed computational power for processing data‐intensive applications related to big data technologies. Cluster computing with distributed computation infrastructure provides fast and reliable data processing power to gigantic‐sized big data solutions with integrated and geographically separated autonomous resources. They make a cost‐effective solution to big data as they do allow multiple applications to share the computing resources. They are flexible to add more computing resources as required by the big data technology
  • Book cover image for: Data-Intensive Text Processing with MapReduce
    • Jimmy Lin, Chris Dyer(Authors)
    • 2022(Publication Date)
    • Springer
      (Publisher)
    1 C H A P T E R 1 Introduction MapReduce [45] is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of com- modity servers. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (now an Apache project). Today, a vibrant software ecosystem has sprung up around Hadoop, with significant activity in both industry and academia. This book is about scalable approaches to processing large amounts of text with MapReduce. Given this focus, it makes sense to start with the most basic question: Why? There are many answers to this question, but we focus on two. First, “big data” is a fact of the world, and therefore an issue that real-world systems must grapple with. Second, across a wide range of text processing applications, more data translates into more effective algorithms, and thus it makes sense to take advantage of the plentiful amounts of data that surround us. Modern information societies are defined by vast repositories of data, both public and private. Therefore, any practical application must be able to scale up to datasets of interest. For many, this means scaling up to the web, or at least a non-trivial fraction thereof. Any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing web content must tackle large- data problems: “web-scale” processing is practically synonymous with data-intensive processing.This observation applies not only to well-established internet companies, but also countless startups and niche players as well.
  • Book cover image for: R for Programmers
    eBook - PDF

    R for Programmers

    Mastering the Tools

    • Dan Zhang(Author)
    • 2016(Publication Date)
    • CRC Press
      (Publisher)
    253 Chapter 7 RHadoop This chapter mainly introduces how to use R to access a Hadoop cluster through a RHadoop tool, help readers to manage HDFS using R, develop MapReduce programs, and access HBase access. This chapter uses R to implement MapReduce program cases based on a collaborative filtering algorithm. It’s more concise than Java. 7.1 R Has Injected Statistical Elements into Hadoop Question Why should we combine R with Hadoop? R has injected statistical elements into Hadoop http://blog.fens.me/r-Hadoop-intro/ R and Hadoop belong to two different disciplines. They have different user groups, they are based on two different knowledge systems, and they do different things. But data, as their intersection, have made the combination with R and Hadoop an interdisciplinary choice, a tool to mine the value of data. 254 ◾ R for Programmers: Mastering the Tools 7.1.1 Introduction to Hadoop For people in the IT world, Hadoop is a rather familiar technique. Hadoop is a distributed systematic basic construction, managed by Apache Fund. Users can develop distributional pro-grams even if they know little about underlying details of distributed construction, and make full use of the high-speed calculation and storage of clusters. Hadoop has implemented a dis-tributed file system (Hadoop Distributed File System, HDFS). HDFS is high in fault tolerance and designed to be deployed in low-cost hardware. In addition, it provides high throughput to access the data of applications, which makes it suitable for the statistical analysis of data warehouses. Hadoop has many family members, including Hive, HBase, Zookeeper, Avro, Pig, Ambari, Sqoop, Mahout and Chukwa, and so forth. I’ll provide a brief introduction to these. ◾ Hive is a data warehouse tool based on Hadoop. It can map the structured data file to a data-base table and quickly achieve simple MapReduce statistics through SQL-like statements.
  • Book cover image for: Mining of Massive Datasets
    2 Large-Scale File Systems and Map-Reduce Modern Internet applications have created a need to manage immense amounts of data quickly. In many of these applications, the data is extremely regular, and there is ample opportunity to exploit parallelism. Important examples are: (1) The ranking of Web pages by importance, which involves an iterated matrix- vector multiplication where the dimension is in the tens of billions, and (2) Searches in “friends” networks at social-networking sites, which involve graphs with hundreds of millions of nodes and many billions of edges. To deal with applications such as these, a new software stack has developed. It begins with a new form of file system, which features much larger units than the disk blocks in a conventional operating system and also provides replication of data to protect against the frequent media failures that occur when data is distributed over thousands of disks. On top of these file systems, we find higher-level programming systems de- veloping. Central to many of these is a programming system called map-reduce. Implementations of map-reduce enable many of the most common calculations on large-scale data to be performed on large collections of computers, efficiently and in a way that is tolerant of hardware failures during the computation. Map-reduce systems are evolving and extending rapidly. We include in this chapter a discussion of generalizations of map-reduce, first to acyclic workflows and then to recursive algorithms. We conclude with a discussion of communica- tion cost and what it tells us about the most efficient algorithms in this modern computing environment. 2.1 Distributed File Systems Most computing is done on a single processor, with its main memory, cache, and local disk (a compute node). In the past, applications that called for parallel processing, such as large scientific calculations, were done on special-purpose parallel computers with many processors and specialized hardware.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.