Computer Science

Stream Processing

Stream processing is a method of data processing that involves continuous processing of data in real-time as it is generated. It involves the use of streaming data pipelines that can handle large volumes of data and process it in parallel. Stream processing is commonly used in applications such as financial trading, social media analytics, and IoT data processing.

Written by Perlego with AI-assistance

9 Key excerpts on "Stream Processing"

  • Book cover image for: Internet of Things
    eBook - ePub

    Internet of Things

    Principles and Paradigms

    • Rajkumar Buyya, Amir Vahid Dastjerdi(Authors)
    • 2016(Publication Date)
    • Morgan Kaufmann
      (Publisher)
    Normally, timestamps can be implemented in two forms: (1) as a string of absolute time-values, which consumes more resources to be processed, but makes it easier for developers to devise joint algorithms on separate streams; or (2) as a sequence of positive real-time intervals that only record the relative order of data elements in the same stream. The latter form alleviates the stress of the network by reducing the size of the timestamp, but it is harder to reorder the sequence of events across different streams with only in-stream intervals.

    8.2.2. Stream Processing

    Stream Processing is a one-pass data-processing paradigm that always keeps the data in motion to achieve low processing-latency. As a higher abstraction of messaging systems, Stream Processing supports not only the message aggregation and delivery, but also is capable of performing real-time asynchronous computation while passing along the information. The most important feature of the streaming paradigm is that it does not have access to all data. By contrast, it normally adopts the one-at-a-time processing model, which applies standing queries or established rules to data streams in order to get immediate results upon their arrival.
    All of the computation in this paradigm is handled by the continuously dedicated logic-processing system, which is a scalable, highly available, and fault-tolerant architecture that provides system-level integration of a continuous data stream. As a consequence of the timeliness requirement, computations for analytics and pattern recognition should be relatively simple and generally independent, and it is common to utilize distributed-commodity machines to achieve high throughput with only sub-second latency.
    However, there is another subclass of stream-processing systems that follows the microbatch model. Compared to the aforementioned one-at-a-time model, in which it is difficult to maintain the processing state and guarantee the high-level fault-tolerance efficiently, the microbatch model excels in controllability as a hybrid approach, combining a one-pass streaming pipeline with the data batches of very small size. It greatly eases the implementation of windowing and stateful computation, but at the cost of higher processing-latency. Although such a model is called microbatch , we still consider it to be a derived form of Stream Processing, as long as the target data remains constantly on the move while it is being processed. In order to better illustrate the basic idea of Stream Processing, we compare it to the well-known batch paradigm in Table 8.1
  • Book cover image for: Hands-On Data Analysis with Scala
    No longer available |Learn more

    Hands-On Data Analysis with Scala

    Perform data collection, processing, manipulation, and visualization with Scala

    Although the preceding example is simple, it highlights the advantages of incremental and continuous data processing. The amount of recomputing that needs to be redone as a result of new data is generally significantly lower compared to reprocessing the entire dataset. For real-time or near real-time applications, fast response time is a necessity for success. Stream-oriented processing helps us to achieve that response time requirement. Stream Processing does introduce some degree of complexity because some of the state information needs to be preserved, and the processing algorithm needs to be refactored and adapted to allow for continuous updates upon the arrival of new data.
    The following diagram illustrates the general model of Stream Processing, where a stream processor is acting upon one or more observations at a time:
    The follows facts should be noted regarding this generalized model:
    • We can think of a stream as unbounded information written on tape. Each observation is recorded on the tape as it happens.
    • Observation t1 arrives before Observation t2 , and so on. Observation t1 is written first, t2 next, and so on.
    • The stream processor sees Observation t1 before t2, and so on. It computes the results based on t1 first, t2 next, and so on.
    We can also imagine that streams of information are passing through the processor, and it is computing the results as and when this happens.
    In reality, most streaming solutions use microbatches, where these accumulate certain amounts of information before handing it off to the processor. This is done to make the processing more efficient.
    Passage contains an image

    Spark Streaming overview

    Spark Streaming is an extension of the core Spark API that enables scalable and fault-tolerant, stream-oriented processing of data. Spark provides the ability to stream data from multiple sources, with a number of key sources being the following:
  • Book cover image for: Designing Big Data Platforms
    eBook - ePub

    Designing Big Data Platforms

    How to Use, Deploy, and Maintain Big Data Systems

    • Yusuf Aytas(Author)
    • 2021(Publication Date)
    • Wiley
      (Publisher)
    Stream Big Data processing is a vital part of a modern Big Data platform. Stream Processing can help a modern Big Data platform from many different perspectives. Streaming data gives the platform to cleanse, filter, categorize, and analyze the data while it is in motion. Thus, we don't have to store irrelevant and fruitless data to disk. With stream Big Data processing, we get a chance to respond to user interactions or events swiftly rather than waiting for more significant periods. Having fast loops of discovery and acting can introduce a competitive advantage to the businesses. Streaming solutions bring additional agility with added risk. We can change the processing pipeline and see the results very quickly. Nevertheless, it poses the threat of losing data or mistaking it.
    We can employ a new set of solutions with stream Big Data processing. Some of these solutions are as follows:
    • Fraud detection
    • Anomaly detection
    • Alerting
    • Real‐time monitoring
    • Instant machine learning updates
    • Ad hoc analysis of real‐time data.

    6.2 Defining Stream Data Processing

    In Chapter 5 , I defined offline Big Data processing as free of commitment to the user or events with respect to time. On the other hand, stream Big Data processing has to respond promptly. The delay between requests and responses can differ from seconds to minutes. Nonetheless, there is still a commitment to respond within a specified amount of time. A stream never ends. Hence, there is a commitment to responding to continuous data when processing streaming Big Data.
    Data streams are unbounded. Yet, most of the analytical functions like sum or average require bounded sets. Stream Processing has the concept of windowing to get bounded sets of data. Windows enable computations where it would not be possible since there is no end to a stream. For example, we can't find the average page views per visitor to a website since there will be more visitors and more page views every second. The solution is splitting data into defined chunks to reason about it in a bounded fashion called windowing. There are several types of windows, but the most notable ones are time and count windows, where we divide data into finite chunks by time or count, respectively. In addition to windowing, some Stream Processing frameworks support watermarking helps a Stream Processing engine to manage the arrivals of late events. Fundamentally, a watermark defines a waiting period for an engine to consider late events. If an event arrives within a watermark, the engine recomputes a query. If the event arrives later than the watermark, the engine drops the event. When dealing with the average page views problem, a better approach would be finding average page views per visitor in the last five minutes, as shown in Figure 6.1 . We can also use a count window to calculate averages after receiving x
  • Book cover image for: Knowledge Discovery in Big Data from Astronomy and Earth Observation
    • Petr Skoda, Fathalrahman Adam(Authors)
    • 2020(Publication Date)
    • Elsevier
      (Publisher)
    Chapter 9

    Real-Time Stream Processing in Astronomy

    Veljko Vujčić; Darko Jevremović, Dr     

    Abstract

    Event processing is an umbrella term for technologies and that are conceptually centered around ingestion, manipulation, and dissemination of events. An event could be any discrete occurrence within a defined domain which is interesting enough to mark down. In most scenarios, events are organized in streams which can arrive from various sources and have a heterogeneous structure. Event processing concepts and technologies are proven in industries that demand high throughput of data and quick decision making and can be tailored for astronomical data stream analytics. They enable real-time pattern application, suitable for Big Data systems where automatized reaction: is crucial. Upcoming big astronomical surveys such as LSST (2023) and ZTF (already in operation), along with various simulated data streams, can be analyzed by applying event processing mechanism such as filtering, aggregation, pattern matching, etc.

    Keywords

    Event processing; Pattern matching; Astronomical data streams; Complex event processing; Transient astronomy; Variable astronomical objects

    Acknowledgement

    The gear icon for Fig. 9.1 was made by https://www.flaticon.com/authors/gregor-cresnar .
    Fig. 9.1 Comparison of query paradigms in event processing and common transactional applications.
    Fig. 9.2 A proposed event processing network for an astronomical data stream.
    Fig. 9.3 Main categories of variable objects, based on Eyer and Mowlavi (2008) .
    Fig. 9.4 Four basic flavors of event processing “windows.”

    9.1 Introduction

    One of the obstacles in “Big Data”-related use cases is when large quantities of data have to be analyzed as they happen. Systems for algorithmic trading, social networks analytics, numerous kinds of monitoring applications, intelligent data transfer and integration architectures, they all employ stream analysis in distributed systems on real-time Big Data with success. Even probabilistic and predictive scenarios using complex or uncertain data can be tackled with event processing technologies. Although classification algorithms and knowledge extraction are not new to the astronomical science, they were not commonly employed “online”1
  • Book cover image for: Mining of Massive Datasets
    This technique generalizes to approximating various kinds of sums. 4.1 The Stream Data Model Let us begin by discussing the elements of streams and Stream Processing. We explain the difference between streams and databases and the special problems that arise when dealing with streams. Some typical applications where the stream model applies will be examined. 4.1.1 A Data-Stream-Management System In analogy to a database-management system, we can view a stream processor as a kind of data-management system, the high-level organization of which is suggested in Fig. 4.1. Any number of streams can enter the system. Each stream 124 Mining Data Streams Queries Standing Stream Processor . . . 0, 1, 1, 0, 1, 0, 0, 0 q, w, e, r, t, y, u, i, o 1, 5, 2, 7, 4, 0, 3, 5 time Output streams Storage Storage Archival Streams entering Ad-hoc Queries Limited Working Figure 4.1 A data-stream-management system can provide elements at its own schedule; they need not have the same data rates or data types, and the time between elements of one stream need not be uniform. The fact that the rate of arrival of stream elements is not under the control of the system distinguishes Stream Processing from the processing of data that goes on within a database-management system. The latter system controls the rate at which data is read from the disk, and therefore never has to worry about data getting lost as it attempts to execute queries. Streams may be archived in a large archival store, but we assume it is not possible to answer queries from the archival store. It could be examined only under special circumstances using time-consuming retrieval processes. There is also a working store, into which summaries or parts of streams may be placed, and which can be used for answering queries. The working store might be disk, or it might be main memory, depending on how fast we need to process queries.
  • Book cover image for: Knowledge Discovery from Data Streams
    Chapter 2 Introduction to Data Streams Nowadays, we are in the presence of sources of data produced continuously at high speed. Examples include TCP/IP traffic, GPS data, mobile calls, emails, sensor networks, customer click streams, etc. These data sources con-tinuously generate huge amounts of data from nonstationary distributions. Storage, maintenance, and querying data streams brought new challenges in the database and data mining communities. The database community has de-veloped Data Stream Management Systems (DSMS) for continuous querying, compact data structures (sketches and summaries), and sub-linear algorithms for massive dataset analysis. In this chapter, we discuss relevant issues and illustrative techniques developed in Stream Processing that might be relevant for data stream mining. 2.1 Data Stream Models Data streams can be seen as stochastic processes in which events occur continuously and independently from each another. Querying data streams is quite different from querying in the conventional relational model. A key idea is that operating on the data stream model does not preclude the use of data in conventional stored relations: data might be transient . What makes process-ing data streams different from the conventional relational model? The main differences are summarized in Table 2.1. Some relevant differences include: 1. The data elements in the stream arrive on-line. 2. The system has no control over the order in which data elements arrive, either within a data stream or across data streams. 3. Data streams are potentially unbounded in size. 4. Once an element from a data stream has been processed, it is discarded or archived. It cannot be retrieved easily unless it is explicitly stored in memory, which is small relative to the size of the data streams. In the streaming model (Muthukrishnan, 2005) the input elements f 1 , f 2 , . . . , f j , . . . arrive sequentially, item by item, and describe an underlying 7
  • Book cover image for: Big Java
    eBook - PDF

    Big Java

    Early Objects

    • Cay S. Horstmann(Author)
    • 2019(Publication Date)
    • Wiley
      (Publisher)
    645 C H A P T E R 19 Stream Processing C H A P T E R G O A L S To be able to develop filter/map/ reduce strategies for solving data processing problems To convert between collections and streams To use function objects for transformations and predicates To work with the Optional type To be able to express common algorithms as stream operations C H A P T E R C O N T E N T S © adventtr/iStockphoto. 19.1 THE STREAM CONCEPT 646 19.2 PRODUCING STREAMS 648 19.3 COLLECTING RESULTS 649 PT 1 One Stream Operation Per Line 651 ST 1 Infinite Streams 651 19.4 TRANSFORMING STREAMS 652 CE 1 Don’t Use a Terminated Stream 654 19.5 LAMBDA EXPRESSIONS 654 SYN Lambda Expressions 655 PT 2 Keep Lambda Expressions Short 656 ST 2 Method and Constructor References 656 ST 3 Higher-Order Functions 657 ST 4 Higher-Order Functions and Comparators 658 19.6 THE OPTIONAL TYPE 659 CE 2 Optional Results Without Values 660 19.7 OTHER TERMINAL OPERATIONS 661 CE 3 Don’t Apply Mutations in Parallel Stream Operations 662 19.8 PRIMITIVE-TYPE STREAMS 663 19.9 GROUPING RESULTS 665 19.10 COMMON ALGORITHMS REVISITED 667 HT 1 Working with Streams 670 WE1 Word Properties 672 WE2 A Movie Database 673 646 Streams let you process data by specifying what you want to have done, leaving the details to the stream library. The library can execute operations lazily, skipping those that are not needed for computing the result, or distribute work over multiple processors. This is particularly useful when working with very large data sets. Moreover, stream computations are often easier to understand because they express the intent of the programmer more clearly than explicit loops. In this chapter, you will learn how to use Java streams for solving common “big data” processing problems. 19.1 The Stream Concept When analyzing data, you often write a loop that examines each item and collects the results. For example, suppose you have a list of words and want to know how many words have more than ten letters.
  • Book cover image for: Mastering Hadoop 3
    No longer available |Learn more

    Mastering Hadoop 3

    Big data processing at scale to unlock unique business insights

    • Chanchal Singh, Manish Kumar(Authors)
    • 2019(Publication Date)
    • Packt Publishing
      (Publisher)
    Streaming datasets are about doing data processing, not on bounded data, but on unbounded data. Typical datasets are bounded. That means they are complete. At the very least, you will process data as if it were complete. Realistically, we know that there will always be new data, but as far as data processing is concerned, we will treat it as if it were a complete dataset. In the case of bounded data, data processing is done in phases and until and unless one phase is complete, other phases of data processing do not start. Another way to think about bounded data processing is that we will be done analyzing the data before new data comes in. Bounded datasets are finite in size. The following diagram represents how bounded data is processed using a typical MapReduce batch processing engine:
    On the other hand, if you have an unbounded dataset (also known as an infinite dataset ), it is never complete; there is always new data coming in, and typically, data is coming in even as you are analyzing data. So, we tend to think about analysis on unbounded datasets, as this is a temporary thing, carried out many times. It is valid only at a particular point in time. So, streaming is essentially data processing on unbounded data. Bounded data is data at rest. Stream Processing is how you deal with data that is not at rest with unbounded data. But, more broadly, people often talk about streaming as an execution engine. An important feature of stream data processing is that it is highly unsettled with regard to event times, which means you need some kind of time-based shuffles in your pipeline to analyze the data against the background. The following diagram represents unbounded infinite datasets:
    Passage contains an image

    Stream data ingestion

    Data ingestion represents a mechanism in which data is moved from a specific type of source to destination storage, where it can be further used for advanced analytics. Where there are very large data volumes, data is generally streamed to the destination storage, but only on the condition that the source and destination systems are capable of handling continuous streams of data. Stream data ingestion can be of one of two types: one is event-based and another one uses message queues.
  • Book cover image for: Essential PySpark for Scalable Data Analytics
    Examples of near real-time analytics use cases are presented, in detail, in the Real-time analytics industry use cases section. Apache Spark was designed to handle near real-time analytics use cases that require maximum throughput for large volumes of data with scalability. Now that you have an understanding of streaming data sources, data sinks, and the kind of real-time use cases that Spark's Structured Streaming is better suited to solve, let's take a deeper dive into the actual streaming engines. Stream Processing engines A Stream Processing engine is the most critical component of any real-time data analytics system. The role of the Stream Processing engine is to continuously process events from a streaming data source and ingest them into a streaming data sink. The Stream Processing engine can process events as they arrive in a real real-time fashion or group a subset of events into a small batch and process one micro-batch at a time in a near real-time manner. The choice of the engine greatly depends on the type of use case and the processing latency requirements. Some examples of modern streaming engines include Apache Storm, Apache Spark, Apache Flink, and Kafka Streams. Apache Spark comes with a Stream Processing engine called Structured Streaming, which is based on Spark's SQL engine and DataFrame APIs. Structured Streaming uses the micro-batch style of processing and treats each incoming micro-batch as a small Spark DataFrame. It applies DataFrame operations to each micro-batch just like any other Spark DataFrame. The programming model for Structured Streaming treats the output dataset as an unbounded table and processes incoming events as a stream of continuous micro-batches
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.