Computer Science

Apache Flink

Apache Flink is an open-source, distributed computing system that processes large amounts of data in real-time. It is designed to support batch processing, stream processing, and graph processing. Flink provides a unified programming model for all these processing modes and can run on various platforms, including Hadoop, Kubernetes, and standalone clusters.

Written by Perlego with AI-assistance

Related key terms

1 of 5

8 Key excerpts on "Apache Flink"

eBook - ePub
Data Lake for Enterprises
- Tomcy John, Pankaj Misra(Authors)
- 2017(Publication Date)
- Packt Publishing
  (Publisher)
Part 3 ):
Figure 03: Technology mapping for Data Ingestion Layer
Inline with our use case of SCV, the data from the messaging layer is taken in by this layer and then enriched and transformed accordingly and passed onto the Lambda Layer. We might also pass this data to the Data Storage Layer for persisting as well.

In this layer there might be other technologies such as Kafka Consumer, Flume and so on. to take of certain aspects in the real working example of SCV. Part 3 will bring these technologies together so that a clear SCV is derived for enterprise use.
Passage contains an image

What is Apache Flink?

Apache Flink is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. - flink.apache.org
Apache Flink is a community-driven open source framework for distributed big data analytics, like Hadoop and Spark. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs.
- Wikipedia
Apache’s definition of Flink is somewhat easy to understand and the second part of Wikipedia's definition is quite hard to understand. For the time being just understand that Flink brings a unified programming model for handling stream and batch data using one technology.
This chapter in no way covers in a comprehensive way the working of Apache Flink. Apache Flink is a topic by itself spanning an entire book.
However, without giving too much details, it tries to cover many aspects of this awesome tool. We will skim through some of the core aspects and we will also give you enough information to actually use Flink in your Data Lake implementation.
Sign up to read
Learn more about book
eBook - ePub
Big Data Analytics with Hadoop 3
Build highly effective analytics solutions to gain valuable insight into your big data
- Sridhar Alla(Author)
- 2018(Publication Date)
- Packt Publishing
  (Publisher)
Stream Processing with Apache Flink

In this chapter, we will look at stream processing using Apache Flink and how the framework can be used to process data as soon as it arrives to build exciting real-time applications. We will start with the DataStream API and look at various operations that can be performed.
We will be looking at the following:

Data processing using the DataStream API

Transformations

Aggregations

Window

Physical partitioning

Rescaling

Data sinks

Event time and watermarks

Kafka connector

Twitter connector

Elasticsearch connector

Cassandra connector

Passage contains an image

Introduction to streaming execution model

Flink is an open source framework for distributed stream processing that:

Provides results that are accurate, even in the case of out-of-order or late-arriving data

Is stateful and fault tolerant, and can seamlessly recover from failures while maintaining an exactly-once application state

Performs on a large scale, running on thousands of nodes with very good throughput and latency characteristics

The following diagram is a generalized view of stream processing:
Many of Flink's features - state management, handling out-of-order data, flexible windowing – are essential for computing accurate results on unbounded datasets and are enabled by Flink's streaming execution model:

Flink guarantees exactly-once semantics for stateful computations. Stateful means that applications can maintain an aggregation or summary of data that has been processed over time, and Flink's checkpointing mechanism ensures exactly-once semantics for an application's state in the event of a failure:

Flink supports stream processing and windowing with event-time semantics. Event time makes it easy to compute accurate results over streams where events arrive out of order and where events may arrive delayed:

Flink supports flexible windowing based on time, count, or sessions, in addition to data-driven windows. Windows can be customized with flexible triggering conditions to support sophisticated streaming patterns. Flink's windowing makes it possible to model the reality of the environment in which data is created:
Sign up to read
Learn more about book
eBook - ePub
Mastering Hadoop 3
Big data processing at scale to unlock unique business insights
- Chanchal Singh, Manish Kumar(Authors)
- 2019(Publication Date)
- Packt Publishing
  (Publisher)
We have a huge number of data processing tools available in the market. Most of them are open sourced and a few of them are commercial. The question is, how many processing tools or engines do we need? Can't we have just one processing framework that can fulfill the processing requirement of each and every use case that has different processing patterns? Apache Spark was built for the purpose of solving these problem and came up with a unified system architecture where use cases ranging from batch, near-real-time, machine learning models, and so on can be solved using the rich Spark API.

Apache Spark was not suitable for real-time processing use cases where event-by-event processing is needed. Apache Flink came up with a few new design models to solve similar problems that Spark was trying to solve in addition to its real-time processing capability.

Apache Flink is an open source distributed processing framework for stream and batch processing. The dataflow engine is the core of Flink and provides capabilities such as distribution, communication, and fault tolerance. Flink is very much similar to Spark but has an API for custom memory management, real-time data processing, and so on, which makes it a little different from Spark, which works on micro batches instead of real time.
Passage contains an image

Flink architecture

Like other distributed processing engines, Apache Fink also follows the master slave architecture. The Job manager is a master and the Task Manager are worker processes. The following diagram shows the Apache Flink architecture:

Job manager : The Job manager is the master process of the Flink cluster and works as a coordinator. The responsibility of the Job manager is not only to manage the entire life cycle of data flow but also track the progress and state of each stream and operator. It also coordinates the dataflow execution in distributed environments. The Job manager maintains the checkpoint metadata in fault-tolerant storage systems so that if an active Job manager goes down, the standby Job manager
Sign up to read
Learn more about book
eBook - ePub
Parallel Computing Architectures and APIs
IoT Big Data Stream Processing
- Vivek Kale(Author)
- 2019(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
However, maintaining a Lambda system was a hassle to the extent that it entailed building, provisioning, and maintaining two independent versions of the processing pipelines that had to be reconciled at the end. This begged for a unified solution: A functioning stream processing system with the ability to effectively provide “batch processing” functionality on demand. Apache Flink is a noteworthy example (see Subsection 21.1.2.3 ‘Stream Processing Platforms/Engine’).

20.1.2 Data Stream Processing

Due to different challenges and requirements posed by different application domains, several open-source platforms have emerged for real-time data stream processing. Although different platforms share the concept of handling data as continuous unbounded streams, and process it immediately as the data is collected, they follow different architectural models and offer different capabilities and features. Each platform offers very specific special features that make its architecture unique, and some features make a stream processing platform more applicable than others for different scenarios.
A taxonomy is derived after studying different open-source platforms, including DSMS), CEP systems, and stream processing systems such as Storm.

20.1.2.1 Data Stream Processing Systems

The aspects underlying stream processing platforms or engines consist of distributed computing, parallel computing, and message passing. Dataflow has always been the core element of stream processing systems. “Data streams” or “streams” in the stream processing context refers to the infinite dataflow within the system. Stream processing platforms are designed to run on top of distributed and parallel computing technologies such as clusters to process real-time streams of data. The cluster computing technology allows a collection of connected computers to work together providing adequate computing power for the processing of large data sets. Such a feature is so important for stream processing platforms that clusters have become one of its essential modules for processing high-velocity big data.
Sign up to read
Learn more about book
eBook - ePub
Mastering Apache Spark 2.x - Second Edition
- Romeo Kienzler(Author)
- 2017(Publication Date)
- Packt Publishing
  (Publisher)
Apache Spark Streaming

The Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by checkpointing the data stream that is being processed. The following topics will be covered in this chapter after an introductory section, which will provide a practical overview of how Apache Spark processes stream-based data:

Error recovery and checkpointing

TCP-based stream processing

File streams

Kafka stream source

For each topic, we will provide a worked example in Scala and show how the stream-based architecture can be set up and tested.
Passage contains an image

Overview

The following diagram shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS:

These feed into the Spark Streaming module and are processed as Discrete Streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process stream-based data.

The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but we wanted to extend it to express the Spark module functionality:

When discussing Spark Discrete Streams, the previous figure, taken from the Spark website at http://spark.apache.org/ , is the diagram that we would like to use.

The green boxes in the previous figure show the continuous data stream sent to Spark being broken down into a Discrete Stream (DStream ).

A DStream is nothing other than an ordered set of RDDs. Therefore, Apache Spark Streaming is not real streaming, but micro-batching. The size of the RDDs backing the DStream determines the batch size. This way DStreams can make use of all the functionality provided by RDDs including fault tolerance and the capability of being spillable to disk. The size of each element in the stream is then based on a batch time, which might be two seconds.
Sign up to read
Learn more about book
No longer available |Learn more
Apache Spark 2: Data Processing and Real-Time Analytics
Master complex big data processing, stream analytics, and machine learning with Apache Spark
- Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei(Authors)
- 2018(Publication Date)
- Packt Publishing
  (Publisher)
Apache Spark Streaming

The Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster, to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by checkpointing the data stream that is being processed. The following topics will be covered in this chapter after an introductory section, which will provide a practical overview of how Apache Spark processes stream-based data:

Error recovery and checkpointing

TCP-based stream processing

File streams

Kafka stream source

For each topic, we will provide a worked example in Scala and show how the stream-based architecture can be set up and tested.
Passage contains an image

Overview

The following diagram shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS:

These feed into the Spark Streaming module and are processed as Discrete Streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process stream-based data.

The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but we wanted to extend it to express the Spark module functionality:

When discussing Spark Discrete Streams, the previous figure, taken from the Spark website at http://spark.apache.org/ , is the diagram that we would like to use.

The green boxes in the previous figure show the continuous data stream sent to Spark being broken down into a Discrete Stream (DStream ).

A DStream is nothing other than an ordered set of RDDs. Therefore, Apache Spark Streaming is not real streaming, but micro-batching. The size of the RDDs backing the DStream determines the batch size. This way DStreams can make use of all the functionality provided by RDDs including fault tolerance and the capability of being spillable to disk. The size of each element in the stream is then based on a batch time, which might be two seconds.
Sign up to read
Learn more about book
eBook - ePub
Designing Big Data Platforms
How to Use, Deploy, and Maintain Big Data Systems
- Yusuf Aytas(Author)
- 2021(Publication Date)
- Wiley
  (Publisher)
Figure 6.13 .

Figure 6.13
Flink architecture.

Flink architecture allows running computations over bounded and unbounded streams. Unbounded streams have no defined end. Order plays a significant role in unbounded stream processing since data still flows through the engine. Flink supports precise control over time and state for unbounded streams. On the flip side, bounded streams have a defined end. Order is not an issue for a bounded stream since it can be sorted. Flink uses data structure and algorithms specifically designed for bounded streams to get the best performance.

Flink processes both bounded and unbounded streams by a basic unit called a task. Tasks execute a parallel instance of an operator. Thus, an operator's life cycle also defines the life cycle of a task. An operator goes through several phases. In the initialization phase, Flink prepares the operator to run, sets the state. In the process element phase, the operator applies the transformation to the received element. In the process watermark phase, the operator receives a watermark to advance the internal timestamp. In the checkpoint phase, the operator writes the current state to external state storage asynchronously. Tasks contain multiple chained operators. In the initialization phase, tasks provide the state of all operators in the chain. The task's wide state helps with recovering from failures. After setting state, the task runs the chain of operators until it comes to an end for bounded streams or canceled for unbounded streams.

Flink can run tasks in many different environments. Flink integrates with all common clustering technologies such as YARN, Mesos, or Kubernetes. It can also run as a standalone cluster or local JVM for development purposes. Flink provides resource manager‐specific deployment options. Flink automatically defines the required resources for processing streaming applications. It then negotiates resources with the resource manager. Flink replaces failed containers with new ones.
Sign up to read
Learn more about book
No longer available |Learn more
Machine Learning with Apache Spark Quick Start Guide
Uncover patterns, derive actionable insights, and learn from big data using MLlib
- Jillur Quddus(Author)
- 2018(Publication Date)
- Packt Publishing
  (Publisher)
stream processing engines available to allow us to do this, including—but not limited—to the following:
- Apache Spark: https://spark.apache.org/
- Apache Storm: http://storm.apache.org/
- Apache Flink: https://flink.apache.org/
- Apache Samza: http://samza.apache.org/
- Apache Kafka (via its Streams API): https://kafka.apache.org/documentation/
- KSQL: https://www.confluent.io/product/ksql/
Though a detailed comparison of the available stream processing engines is beyond the scope of this book, you are encouraged to explore the preceding links and study the differing architectures available. For the purposes of this chapter, we will be using Apache Spark's Structured Streaming engine as our stream processing engine of choice.
Passage contains an image

Streaming using Apache Spark
At the time of writing, there are two stream processing APIs available in Spark:

Spark Streaming (DStreams): https://spark.apache.org/docs/latest/streaming-programming-guide.html

Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Passage contains an image

Spark Streaming (DStreams)

Spark Streaming (DStreams) extends the core Spark API and works by dividing real-time data streams into input batches that are then processed by Spark's core API, resulting in a final stream of processed batches , as illustrated in Figure 8.2 . A sequence of RDDs form what is known as a discretized stream (or DStream), which represents the continuous stream of data:
Figure 8.2: Spark Streaming (DStreams)
Passage contains an image

Structured Streaming

Structured Streaming , on the other hand, is a newer and highly optimized stream processing engine built on the Spark SQL engine in which streaming data can be stored and processed using Spark's Dataset/DataFrame API (see Chapter 1 , The Big Data Ecosystem ). As of Spark 2.3, Structured Streaming offers the ability to process data streams using both micro-batch processing, with latencies as low as 100 milliseconds, and continuous processing , with latencies as low as 1 millisecond (thereby providing true
Sign up to read
Learn more about book

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

Explore more topic indexes

1 of 8

View all

Apache Flink

Related key terms

8 Key excerpts on "Apache Flink"

Data Lake for Enterprises

What is Apache Flink?

Big Data Analytics with Hadoop 3

Build highly effective analytics solutions to gain valuable insight into your big data

Stream Processing with Apache Flink

Introduction to streaming execution model

Mastering Hadoop 3

Big data processing at scale to unlock unique business insights

Flink architecture

Parallel Computing Architectures and APIs

IoT Big Data Stream Processing

20.1.2 Data Stream Processing

20.1.2.1 Data Stream Processing Systems

Mastering Apache Spark 2.x - Second Edition

Apache Spark Streaming

Overview

Apache Spark 2: Data Processing and Real-Time Analytics

Master complex big data processing, stream analytics, and machine learning with Apache Spark

Apache Spark Streaming

Overview

Designing Big Data Platforms

How to Use, Deploy, and Maintain Big Data Systems

Machine Learning with Apache Spark Quick Start Guide

Uncover patterns, derive actionable insights, and learn from big data using MLlib

Streaming using Apache Spark

Spark Streaming (DStreams)

Structured Streaming

Explore more topic indexes