eBook - ePub

Building Big Data Pipelines with Apache Beam

Name: Building Big Data Pipelines with Apache Beam
Author: Jan Lukavsky

Jan Lukavsky

Condividi libro

342 pagine
English
ePUB (disponibile sull'app)
Disponibile su iOS e Android

eBook - ePub

Building Big Data Pipelines with Apache Beam

Jan Lukavsky

Dettagli del libro

Anteprima del libro

Indice dei contenuti

Citazioni

Informazioni sul libro

Implement, run, operate, and test data processing pipelines using Apache BeamKey Features• Understand how to improve usability and productivity when implementing Beam pipelines• Learn how to use stateful processing to implement complex use cases using Apache Beam• Implement, test, and run Apache Beam pipelines with the help of expert tips and techniquesBook DescriptionApache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing.This book will help you to confidently build data processing pipelines with Apache Beam. You'll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You'll also learn how to test and run the pipelines efficiently. As you progress, you'll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you'll understand advanced Apache Beam concepts, such as implementing your own I/O connectors.By the end of this book, you'll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.What you will learn• Understand the core concepts and architecture of Apache Beam• Implement stateless and stateful data processing pipelines• Use state and timers for processing real-time event processing• Structure your code for reusability• Use streaming SQL to process real-time data for increasing productivity and data accessibility• Run a pipeline using a portable runner and implement data processing using the Apache Beam Python SDK• Implement Apache Beam I/O connectors using the Splittable DoFn APIWho this book is forThis book is for data engineers, data scientists, and data analysts who want to learn how Apache Beam works. Intermediate-level knowledge of the Java programming language is assumed.

Domande frequenti

Come faccio ad annullare l'abbonamento?

È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui

È possibile scaricare libri? Se sì, come?

Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui

Che differenza c'è tra i piani?

Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.

Cos'è Perlego?

Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.

Perlego supporta la sintesi vocale?

Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.

Building Big Data Pipelines with Apache Beam è disponibile online in formato PDF/ePub?

Sì, puoi accedere a Building Big Data Pipelines with Apache Beam di Jan Lukavsky in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Informatik e Datenmodellierung- & design. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Editore

Packt Publishing

Anno

2022

ISBN

9781800566569

Edizione

Argomento

Informatik

Categoria

Datenmodellierung- & design

Section 1 Apache Beam: Essentials

This section represents a general introduction to how most streaming data processing systems work, what the general properties of data streams are, and what problems are needed to be solved for computational correctness and for balancing throughput and latency in the context of Apache Beam. This section also covers how pipelines are implemented, tested, and run.

This section comprises the following chapters:

Chapter 1, Introduction to Data Processing with Apache Beam
Chapter 2, Implementing, Testing, and Deploying Basic Pipelines
Chapter 3, Implementing Pipelines Using Stateful Processing

Chapter 1: Introduction to Data Processing with Apache Beam

Data. Big data. Real-time data. Data streams. Many buzzwords to describe many things, and yet they have many common properties. Mind-blowing applications can be developed from the successful application of (theoretically) simple logic – take data and produce knowledge. However, a simple-sounding task can turn out to be difficult when the amount of data needed to produce knowledge is huge (and still growing). Given the vast volumes of data produced by humanity every day, which tools should we choose to turn our simple logic into scalable solutions? That is, solutions that protect our investment in creating the data extraction logic, even in the presence of new requirements arising or changing on a daily basis, and new data processing technologies being created? This book focuses on why Apache Beam might be a good solution to these challenges, and it will guide you through the Beam learning process.

In this chapter, we will cover the following topics:

Why Apache Beam?
Writing your first pipeline
Running a pipeline against streaming data
Exploring the key properties of Unbounded data
Measuring the event time progress inside data streams
Assigning data to windows
Unifying batch and streaming data processing

Technical requirements

In this chapter, we will introduce some elementary pipelines written using Beam's Java Software Development Kit (SDK).

We will use the code located in the GitHub repository for this book: https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam.

We will also need the following tools to be installed:

Java Development Kit (JDK) 11 (possibly OpenJDK 11), with JAVA_HOME set appropriately
Git
Bash
Important note
Although it is possible to run many tools in this book using the Windows shell, we will focus on using Bash scripting only. We hope Windows users will be able to run Bash using virtualization or Windows Subsystem for Linux (or any similar technology).

First of all, we need to clone the repository:

To do this, we create a suitable directory, and then we run the following command:
$ git clone https://github.com/PacktPublishing/Building-Big-Data-Pipelines-with-Apache-Beam.git
This will result in a directory, Building-Big-Data-Pipelines-with-Apache-Beam, being created in the working directory. We then run the following command in this newly created directory:
$ ./mvnw clean install

Throughout this book, the $ character will denote a Bash shell. Therefore, $ ./mvnw clean install would mean to run the ./mvnw command in the top-level directory of the git clone (that is, Building-Big-Data-Pipelines-with-Apache-Beam). By using chapter1$ ../mvnw clean install, we mean to run the specified command in the subdirectory called chapter1.

Why Apache Beam?

There are two basic questions we might ask when considering a new technology to learn and apply in practice:

What problem am I struggling with that the new technology can help me solve?
What would the costs associated with the technology be?

Every sound technology has a well-defined selling point – that is, something that justifies its existence in the presence of competing technologies. In the case of Beam, this selling point could be reduced to a single word: portability. Beam is portable on several layers:

Beam's pipelines are portable between multiple runners (that is, a technology that executes the distributed computation described by a pipeline's author).
Beam's data processing model is portable between various programming languages.
Beam's data processing logic is portable between bounded and unbounded data.

Each of these points deserves a few words of explanation. By runner portability, we mean the possibility to run existing pipelines written in one of the supported programming languages (for instance, Java, Python, Go, Scala, or even SQL) against a data processing engine that can be chosen at runtime. A typical example of a runner would be Apache Flink, Apache Spark, or Google Cloud Dataflow. However, Beam is by no means limited to these; new runners are created as new technologies arise, and it's very likely that many more will be developed.

When we say Beam's data processing model is portable between various programming languages, we mean it has the ability to provide support for multiple SDKs, regardless of the language or technology used by the runner. This way, we can code Beam pipelines in the Go language, and then run these against the Apache Flink Runner, written in Java.

Last but not least, the core of Apache Beam's model is designed so that it is portable between bounded and unbounded data. Bounded data is what was historically called batch processing, while unbounded data refers to real-time processing (that is, an application crunching live data as it arrives in the system and producing a low-latency output).

Putting these pieces together, we can describe Beam as a tool that lets you deal with your big data architecture with the following vision:

Choose your preferred language, write your data processing pipeline, run this pipeline using a runner of your choice, and do all of this for both batch and real-time data at the same time.

Because everything comes at a price, you should expect to pay for flexibility like this – this price would be a somewhat bigger overhead in terms of CPU and/or memory usage. The Beam community works...