Simplify Big Data Analytics with Amazon EMR
eBook - ePub

Simplify Big Data Analytics with Amazon EMR

  1. 430 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Simplify Big Data Analytics with Amazon EMR

About this book

Design scalable big data solutions using Hadoop, Spark, and AWS cloud native servicesKey Featuresโ€ข Build data pipelines that require distributed processing capabilities on a large volume of dataโ€ข Discover the security features of EMR such as data protection and granular permission managementโ€ข Explore best practices and optimization techniques for building data analytics solutions in Amazon EMRBook DescriptionAmazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS. This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR. By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.What you will learnโ€ข Explore Amazon EMR features, architecture, Hadoop interfaces, and EMR Studioโ€ข Configure, deploy, and orchestrate Hadoop or Spark jobs in productionโ€ข Implement the security, data governance, and monitoring capabilities of EMRโ€ข Build applications for batch and real-time streaming data analytics solutionsโ€ข Perform interactive development with a persistent EMR cluster and Notebookโ€ข Orchestrate an EMR Spark job using AWS Step Functions and Apache AirflowWho this book is forThis book is for data engineers, data analysts, data scientists, and solution architects who are interested in building data analytics solutions with the Hadoop ecosystem services and Amazon EMR. Prior experience in either Python programming, Scala, or the Java programming language and a basic understanding of Hadoop and AWS will help you make the most out of this book.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weโ€™ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere โ€” even offline. Perfect for commutes or when youโ€™re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Simplify Big Data Analytics with Amazon EMR by Sakti Mishra in PDF and/or ePUB format, as well as other popular books in Computer Science & Artificial Intelligence (AI) & Semantics. We have over one million books available in our catalogue for you to explore.

Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR

This section will provide an overview of Amazon EMR, along with its architecture, cluster nodes, features, benefits, different deployment options, and pricing. Then it will provide an overview of different big data applications EMR supports and showcase common architecture patterns we see with Amazon EMR.
This section comprises the following chapters:
  • Chapter 1, An Overview of Amazon EMR
  • Chapter 2, Exploring the Architecture and Deployment Options
  • Chapter 3, Common Use Cases and Architecture Patterns
  • Chapter 4, Big Data Applications and Notebooks available in Amazon EMR

Chapter 1: An Overview of Amazon EMR

This chapter will provide an overview of Amazon Elastic MapReduce (EMR), its benefits related to big data processing, and how its cluster is designed compared to on-premises Hadoop clusters. It will then explain how Amazon EMR integrates with other Amazon Web Services (AWS) services and how you can build a Lake House architecture in AWS.
You will then learn the difference between the Amazon EMR, AWS Glue, and AWS Glue DataBrew services. Understanding the difference will make you aware of the options available when deploying Hadoop or Spark workloads in AWS.
Before going into this chapter, it is assumed that you are familiar with Hadoop-based big data processing workloads, have had exposure to AWS basis concepts, and are looking to get an overview of the Amazon EMR service so that you can use it for your big data processing workloads.
The following topics will be covered in this chapter:
  • What is Amazon EMR?
  • Overview of Amazon EMR
  • Decoupling compute and storage
  • Integration with other AWS services
  • EMR release history
  • Comparing Amazon EMR with AWS Glue and AWS Glue DataBrew

What is Amazon EMR?

Amazon EMR is an AWS service that provides a distributed cluster for big data processing. Now, before diving deep into EMR, let's first understand what big data represents, for which EMR is a solution or tool.

What is big data?

The beginnings of enormous volumes of datasets date back to the 1970s, when the world of data was just getting started with data centers and the development of relational databases, despite the fact that the concept of big data was still relatively new. These technology revolutions led to personal desktop computers, followed by laptops, and then mobile computers over the next several decades. As people got access to devices, the data being generated started growing exponentially.
Around the year 2005, people started to realize that users generate huge amounts of data. Social platforms, such as Facebook, Twitter, and YouTube generate data faster than ever, as users get access to smart products or internet-related services.
Put simply, big data refers to large, complex datasets, particularly those derived from new data sources. These datasets are large enough that traditional data processing software can't handle its storage and processing efficiently. But these massive volumes of data are of great use when we need to derive insights by analyzing them and then address business problems with it, which we were not able to do before. For example, an organization can analyze their users' or customers' interactions with their social pages or website to identify their sentiment against their business and products.
Often, big data is described by the five Vs. It started with three Vs, which includes data volume, velocity, and variety, but as it evolved, the accuracy and value of data also became major aspects of big data, which is when veracity and value got added to represent it as five Vs. These five Vs are explained as follows:
  • Volume: This represents the amount of data you have for analysis and it really varies from organization to organization. It can range from terabytes to petabytes in scale.
  • Velocity: This represents the speed at which data is being collected or processed for analysis. This can be a daily data feed you receive from your vendor or a real-time streaming use case, where you receive data every second to every minute.
  • Variety: When we talk about variety, it means what the different forms or types of data you receive are for processing or analysis. In general, they are broadly categorized into the following three:
    • Structured: Organized data format with a fixed schema. It can be from relational databases or CSVs or delimited files.
    • Semi-structured: Partially organized data that does not have a fixed schema, for example, XML or JSON files.
    • Unstructured: These datasets are more represented through media files, where they don't have a schema to follow, for example, audio or video files.
  • Veracity: This represents how reliable or truthful your data is. When you plan to analyze big data and derive insights out of it, the accuracy or quality of the data matters.
  • Value: This is often referred to as the worth of the data you have collected as it is meant to give insights that can help the business drive growth.
With the evolution of big data, the primary challenge became how to process such huge volumes of data, because the typical single system processing frameworks were not enough to handle them. It needed a distributed processing computing framework that can do parallel processing.
After understanding what big data represents, let's look at how the Hadoop processing framework helped to solve this big data processing problem statement and why it became so popular.

Hadoop โ€“ processing framework to handle big data

Though there were different technologies or frameworks that came to handle big data, the framework that got the most traction is Hadoop, which is an open source framework designed specifically for storing and analyzing big datasets. It allows combining multiple computers to form a cluster that can do parallel distributed processing to handle gigabyte- to petabyte-scale data.
The following is a data flow model that explains how the input data is collected, stored into Hadoop Distributed File System (HDFS), then processed with Hive, Pig, or Spark big data processing frameworks and the transformed output becomes available for consumption or is transferred to downstream systems or external vendors. It represents a high-level data flow, where input data is collected and stored as raw data. It then gets processed as needed for analysis and then made available for consumption:
Figure 1.1 โ€“ Data flow in a Hadoop cluster
Figure 1.1 โ€“ Data flow in a Hadoop cluster
The following are the main basic components of Hadoop:
  • HDFS: A distributed filesystem that runs on commodity hardware and provides improved data throughput as compared to traditional filesystems and higher reliability with an in-built fault tolerance mechanism.
  • Yet Another Resource Negotiator (YARN): When multiple compute nodes are involved with parallel processing capability, YARN helps to manage and monitor compute CPU and memory resources and also helps in scheduling jobs and tasks.
  • MapReduce: This is a distributed framework that has two basic modules, that is, map and reduce. The map task reads the data from HDFS or a distributed storage layer and converts it into key-value pairs, which then becomes input to the reduce tasks, which ideally aggregates the map...

Table of contents

  1. Simplify Big Data Analytics with Amazon EMR
  2. Contributors
  3. Preface
  4. Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
  5. Chapter 1: An Overview of Amazon EMR
  6. Chapter 2: Exploring the Architecture and Deployment Options
  7. Chapter 3: Common Use Cases and Architecture Patterns
  8. Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR
  9. Section 2: Configuration, Scaling, Data Security, and Governance
  10. Chapter 5: Setting Up and Configuring EMR Clusters
  11. Chapter 6: Monitoring, Scaling, and High Availability
  12. Chapter 7: Understanding Security in Amazon EMR
  13. Chapter 8: Understanding Data Governance in Amazon EMR
  14. Section 3: Implementing Common Use Cases and Best Practices
  15. Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark
  16. Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming
  17. Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi
  18. Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA
  19. Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR
  20. Chapter 14: Best Practices and Cost-Optimization Techniques
  21. Other Books You May Enjoy