Computer Science
Apache Flink
Apache Flink is an open-source, distributed computing system that processes large amounts of data in real-time. It is designed to support batch processing, stream processing, and graph processing. Flink provides a unified programming model for all these processing modes and can run on various platforms, including Hadoop, Kubernetes, and standalone clusters.
Written by Perlego with AI-assistance
Related key terms
1 of 5
8 Key excerpts on "Apache Flink"
- No longer available |Learn more
- Tomcy John, Pankaj Misra(Authors)
- 2017(Publication Date)
- Packt Publishing(Publisher)
Part 3 ):Figure 03: Technology mapping for Data Ingestion LayerInline with our use case of SCV, the data from the messaging layer is taken in by this layer and then enriched and transformed accordingly and passed onto the Lambda Layer. We might also pass this data to the Data Storage Layer for persisting as well.In this layer there might be other technologies such as Kafka Consumer, Flume and so on. to take of certain aspects in the real working example of SCV. Part 3 will bring these technologies together so that a clear SCV is derived for enterprise use.Passage contains an image
What is Apache Flink?
Apache Flink is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. - flink.apache.orgApache Flink is a community-driven open source framework for distributed big data analytics, like Hadoop and Spark. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs.- WikipediaApache’s definition of Flink is somewhat easy to understand and the second part of Wikipedia's definition is quite hard to understand. For the time being just understand that Flink brings a unified programming model for handling stream and batch data using one technology.This chapter in no way covers in a comprehensive way the working of Apache Flink. Apache Flink is a topic by itself spanning an entire book.However, without giving too much details, it tries to cover many aspects of this awesome tool. We will skim through some of the core aspects and we will also give you enough information to actually use Flink in your Data Lake implementation. - No longer available |Learn more
Big Data Analytics with Hadoop 3
Build highly effective analytics solutions to gain valuable insight into your big data
- Sridhar Alla(Author)
- 2018(Publication Date)
- Packt Publishing(Publisher)
Stream Processing with Apache Flink
We will be looking at the following:In this chapter, we will look at stream processing using Apache Flink and how the framework can be used to process data as soon as it arrives to build exciting real-time applications. We will start with the DataStream API and look at various operations that can be performed.- Data processing using the DataStream API
- Transformations
- Aggregations
- Window
- Physical partitioning
- Rescaling
- Data sinks
- Event time and watermarks
- Kafka connector
- Twitter connector
- Elasticsearch connector
- Cassandra connector
Passage contains an image
Introduction to streaming execution model
Flink is an open source framework for distributed stream processing that:- Provides results that are accurate, even in the case of out-of-order or late-arriving data
- Is stateful and fault tolerant, and can seamlessly recover from failures while maintaining an exactly-once application state
- Performs on a large scale, running on thousands of nodes with very good throughput and latency characteristics
Many of Flink's features - state management, handling out-of-order data, flexible windowing – are essential for computing accurate results on unbounded datasets and are enabled by Flink's streaming execution model:- Flink guarantees exactly-once semantics for stateful computations. Stateful means that applications can maintain an aggregation or summary of data that has been processed over time, and Flink's checkpointing mechanism ensures exactly-once semantics for an application's state in the event of a failure:
- Flink supports stream processing and windowing with event-time semantics. Event time makes it easy to compute accurate results over streams where events arrive out of order and where events may arrive delayed:
- Flink supports flexible windowing based on time, count, or sessions, in addition to data-driven windows. Windows can be customized with flexible triggering conditions to support sophisticated streaming patterns. Flink's windowing makes it possible to model the reality of the environment in which data is created:
- No longer available |Learn more
Mastering Hadoop 3
Big data processing at scale to unlock unique business insights
- Chanchal Singh, Manish Kumar(Authors)
- 2019(Publication Date)
- Packt Publishing(Publisher)
We have a huge number of data processing tools available in the market. Most of them are open sourced and a few of them are commercial. The question is, how many processing tools or engines do we need? Can't we have just one processing framework that can fulfill the processing requirement of each and every use case that has different processing patterns? Apache Spark was built for the purpose of solving these problem and came up with a unified system architecture where use cases ranging from batch, near-real-time, machine learning models, and so on can be solved using the rich Spark API.Apache Spark was not suitable for real-time processing use cases where event-by-event processing is needed. Apache Flink came up with a few new design models to solve similar problems that Spark was trying to solve in addition to its real-time processing capability.Apache Flink is an open source distributed processing framework for stream and batch processing. The dataflow engine is the core of Flink and provides capabilities such as distribution, communication, and fault tolerance. Flink is very much similar to Spark but has an API for custom memory management, real-time data processing, and so on, which makes it a little different from Spark, which works on micro batches instead of real time.Passage contains an image
Flink architecture
Like other distributed processing engines, Apache Fink also follows the master slave architecture. The Job manager is a master and the Task Manager are worker processes. The following diagram shows the Apache Flink architecture:- Job manager : The Job manager is the master process of the Flink cluster and works as a coordinator. The responsibility of the Job manager is not only to manage the entire life cycle of data flow but also track the progress and state of each stream and operator. It also coordinates the dataflow execution in distributed environments. The Job manager maintains the checkpoint metadata in fault-tolerant storage systems so that if an active Job manager goes down, the standby Job manager
- eBook - ePub
Parallel Computing Architectures and APIs
IoT Big Data Stream Processing
- Vivek Kale(Author)
- 2019(Publication Date)
- Chapman and Hall/CRC(Publisher)
However, maintaining a Lambda system was a hassle to the extent that it entailed building, provisioning, and maintaining two independent versions of the processing pipelines that had to be reconciled at the end. This begged for a unified solution: A functioning stream processing system with the ability to effectively provide “batch processing” functionality on demand. Apache Flink is a noteworthy example (see Subsection 21.1.2.3 ‘Stream Processing Platforms/Engine’).20.1.2 Data Stream Processing
Due to different challenges and requirements posed by different application domains, several open-source platforms have emerged for real-time data stream processing. Although different platforms share the concept of handling data as continuous unbounded streams, and process it immediately as the data is collected, they follow different architectural models and offer different capabilities and features. Each platform offers very specific special features that make its architecture unique, and some features make a stream processing platform more applicable than others for different scenarios.A taxonomy is derived after studying different open-source platforms, including DSMS), CEP systems, and stream processing systems such as Storm.20.1.2.1 Data Stream Processing Systems
The aspects underlying stream processing platforms or engines consist of distributed computing, parallel computing, and message passing. Dataflow has always been the core element of stream processing systems. “Data streams” or “streams” in the stream processing context refers to the infinite dataflow within the system. Stream processing platforms are designed to run on top of distributed and parallel computing technologies such as clusters to process real-time streams of data. The cluster computing technology allows a collection of connected computers to work together providing adequate computing power for the processing of large data sets. Such a feature is so important for stream processing platforms that clusters have become one of its essential modules for processing high-velocity big data. - No longer available |Learn more
- Romeo Kienzler(Author)
- 2017(Publication Date)
- Packt Publishing(Publisher)
Apache Spark Streaming
The Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by checkpointing the data stream that is being processed. The following topics will be covered in this chapter after an introductory section, which will provide a practical overview of how Apache Spark processes stream-based data:- Error recovery and checkpointing
- TCP-based stream processing
- File streams
- Kafka stream source
Passage contains an image
Overview
The following diagram shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS:These feed into the Spark Streaming module and are processed as Discrete Streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process stream-based data.The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but we wanted to extend it to express the Spark module functionality:When discussing Spark Discrete Streams, the previous figure, taken from the Spark website at http://spark.apache.org/ , is the diagram that we would like to use.The green boxes in the previous figure show the continuous data stream sent to Spark being broken down into a Discrete Stream (DStream ).A DStream is nothing other than an ordered set of RDDs. Therefore, Apache Spark Streaming is not real streaming, but micro-batching. The size of the RDDs backing the DStream determines the batch size. This way DStreams can make use of all the functionality provided by RDDs including fault tolerance and the capability of being spillable to disk. The size of each element in the stream is then based on a batch time, which might be two seconds. - No longer available |Learn more
Apache Spark 2: Data Processing and Real-Time Analytics
Master complex big data processing, stream analytics, and machine learning with Apache Spark
- Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei(Authors)
- 2018(Publication Date)
- Packt Publishing(Publisher)
Apache Spark Streaming
The Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster, to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by checkpointing the data stream that is being processed. The following topics will be covered in this chapter after an introductory section, which will provide a practical overview of how Apache Spark processes stream-based data:- Error recovery and checkpointing
- TCP-based stream processing
- File streams
- Kafka stream source
Passage contains an image
Overview
The following diagram shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS:These feed into the Spark Streaming module and are processed as Discrete Streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process stream-based data.The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but we wanted to extend it to express the Spark module functionality:When discussing Spark Discrete Streams, the previous figure, taken from the Spark website at http://spark.apache.org/ , is the diagram that we would like to use.The green boxes in the previous figure show the continuous data stream sent to Spark being broken down into a Discrete Stream (DStream ).A DStream is nothing other than an ordered set of RDDs. Therefore, Apache Spark Streaming is not real streaming, but micro-batching. The size of the RDDs backing the DStream determines the batch size. This way DStreams can make use of all the functionality provided by RDDs including fault tolerance and the capability of being spillable to disk. The size of each element in the stream is then based on a batch time, which might be two seconds. - eBook - ePub
Designing Big Data Platforms
How to Use, Deploy, and Maintain Big Data Systems
- Yusuf Aytas(Author)
- 2021(Publication Date)
- Wiley(Publisher)
Figure 6.13 .Flink architecture.Figure 6.13Flink architecture allows running computations over bounded and unbounded streams. Unbounded streams have no defined end. Order plays a significant role in unbounded stream processing since data still flows through the engine. Flink supports precise control over time and state for unbounded streams. On the flip side, bounded streams have a defined end. Order is not an issue for a bounded stream since it can be sorted. Flink uses data structure and algorithms specifically designed for bounded streams to get the best performance.Flink processes both bounded and unbounded streams by a basic unit called a task. Tasks execute a parallel instance of an operator. Thus, an operator's life cycle also defines the life cycle of a task. An operator goes through several phases. In the initialization phase, Flink prepares the operator to run, sets the state. In the process element phase, the operator applies the transformation to the received element. In the process watermark phase, the operator receives a watermark to advance the internal timestamp. In the checkpoint phase, the operator writes the current state to external state storage asynchronously. Tasks contain multiple chained operators. In the initialization phase, tasks provide the state of all operators in the chain. The task's wide state helps with recovering from failures. After setting state, the task runs the chain of operators until it comes to an end for bounded streams or canceled for unbounded streams.Flink can run tasks in many different environments. Flink integrates with all common clustering technologies such as YARN, Mesos, or Kubernetes. It can also run as a standalone cluster or local JVM for development purposes. Flink provides resource manager‐specific deployment options. Flink automatically defines the required resources for processing streaming applications. It then negotiates resources with the resource manager. Flink replaces failed containers with new ones. - No longer available |Learn more
Machine Learning with Apache Spark Quick Start Guide
Uncover patterns, derive actionable insights, and learn from big data using MLlib
- Jillur Quddus(Author)
- 2018(Publication Date)
- Packt Publishing(Publisher)
stream processing engines available to allow us to do this, including—but not limited—to the following:- Apache Spark: https://spark.apache.org/
- Apache Storm: http://storm.apache.org/
- Apache Flink: https://flink.apache.org/
- Apache Samza: http://samza.apache.org/
- Apache Kafka (via its Streams API): https://kafka.apache.org/documentation/
- KSQL: https://www.confluent.io/product/ksql/
Though a detailed comparison of the available stream processing engines is beyond the scope of this book, you are encouraged to explore the preceding links and study the differing architectures available. For the purposes of this chapter, we will be using Apache Spark's Structured Streaming engine as our stream processing engine of choice.Passage contains an image
Streaming using Apache Spark
At the time of writing, there are two stream processing APIs available in Spark:- Spark Streaming (DStreams): https://spark.apache.org/docs/latest/streaming-programming-guide.html
- Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Passage contains an image
Spark Streaming (DStreams)
Spark Streaming (DStreams) extends the core Spark API and works by dividing real-time data streams into input batches that are then processed by Spark's core API, resulting in a final stream of processed batches , as illustrated in Figure 8.2 . A sequence of RDDs form what is known as a discretized stream (or DStream), which represents the continuous stream of data:Figure 8.2: Spark Streaming (DStreams)Passage contains an image
Structured Streaming
Structured Streaming , on the other hand, is a newer and highly optimized stream processing engine built on the Spark SQL engine in which streaming data can be stored and processed using Spark's Dataset/DataFrame API (see Chapter 1 , The Big Data Ecosystem ). As of Spark 2.3, Structured Streaming offers the ability to process data streams using both micro-batch processing, with latencies as low as 100 milliseconds, and continuous processing , with latencies as low as 1 millisecond (thereby providing true
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.







