
- 430 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
Simplify Big Data Analytics with Amazon EMR
About this book
Design scalable big data solutions using Hadoop, Spark, and AWS cloud native servicesKey Featuresโข Build data pipelines that require distributed processing capabilities on a large volume of dataโข Discover the security features of EMR such as data protection and granular permission managementโข Explore best practices and optimization techniques for building data analytics solutions in Amazon EMRBook DescriptionAmazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS. This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR. By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.What you will learnโข Explore Amazon EMR features, architecture, Hadoop interfaces, and EMR Studioโข Configure, deploy, and orchestrate Hadoop or Spark jobs in productionโข Implement the security, data governance, and monitoring capabilities of EMRโข Build applications for batch and real-time streaming data analytics solutionsโข Perform interactive development with a persistent EMR cluster and Notebookโข Orchestrate an EMR Spark job using AWS Step Functions and Apache AirflowWho this book is forThis book is for data engineers, data analysts, data scientists, and solution architects who are interested in building data analytics solutions with the Hadoop ecosystem services and Amazon EMR. Prior experience in either Python programming, Scala, or the Java programming language and a basic understanding of Hadoop and AWS will help you make the most out of this book.
Frequently asked questions
- Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
- Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Information
Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
- Chapter 1, An Overview of Amazon EMR
- Chapter 2, Exploring the Architecture and Deployment Options
- Chapter 3, Common Use Cases and Architecture Patterns
- Chapter 4, Big Data Applications and Notebooks available in Amazon EMR
Chapter 1: An Overview of Amazon EMR
- What is Amazon EMR?
- Overview of Amazon EMR
- Decoupling compute and storage
- Integration with other AWS services
- EMR release history
- Comparing Amazon EMR with AWS Glue and AWS Glue DataBrew
What is Amazon EMR?
What is big data?
- Volume: This represents the amount of data you have for analysis and it really varies from organization to organization. It can range from terabytes to petabytes in scale.
- Velocity: This represents the speed at which data is being collected or processed for analysis. This can be a daily data feed you receive from your vendor or a real-time streaming use case, where you receive data every second to every minute.
- Variety: When we talk about variety, it means what the different forms or types of data you receive are for processing or analysis. In general, they are broadly categorized into the following three:
- Structured: Organized data format with a fixed schema. It can be from relational databases or CSVs or delimited files.
- Semi-structured: Partially organized data that does not have a fixed schema, for example, XML or JSON files.
- Unstructured: These datasets are more represented through media files, where they don't have a schema to follow, for example, audio or video files.
- Veracity: This represents how reliable or truthful your data is. When you plan to analyze big data and derive insights out of it, the accuracy or quality of the data matters.
- Value: This is often referred to as the worth of the data you have collected as it is meant to give insights that can help the business drive growth.
Hadoop โ processing framework to handle big data

- HDFS: A distributed filesystem that runs on commodity hardware and provides improved data throughput as compared to traditional filesystems and higher reliability with an in-built fault tolerance mechanism.
- Yet Another Resource Negotiator (YARN): When multiple compute nodes are involved with parallel processing capability, YARN helps to manage and monitor compute CPU and memory resources and also helps in scheduling jobs and tasks.
- MapReduce: This is a distributed framework that has two basic modules, that is, map and reduce. The map task reads the data from HDFS or a distributed storage layer and converts it into key-value pairs, which then becomes input to the reduce tasks, which ideally aggregates the map...
Table of contents
- Simplify Big Data Analytics with Amazon EMR
- Contributors
- Preface
- Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
- Chapter 1: An Overview of Amazon EMR
- Chapter 2: Exploring the Architecture and Deployment Options
- Chapter 3: Common Use Cases and Architecture Patterns
- Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR
- Section 2: Configuration, Scaling, Data Security, and Governance
- Chapter 5: Setting Up and Configuring EMR Clusters
- Chapter 6: Monitoring, Scaling, and High Availability
- Chapter 7: Understanding Security in Amazon EMR
- Chapter 8: Understanding Data Governance in Amazon EMR
- Section 3: Implementing Common Use Cases and Best Practices
- Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark
- Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming
- Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi
- Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA
- Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR
- Chapter 14: Best Practices and Cost-Optimization Techniques
- Other Books You May Enjoy