Section 1: Introduction to Data Streaming and Amazon Kinesis
In this section, you will be introduced to the concept of data streams and how they are used to create scalable data solutions.
This section comprises the following chapters:
- Chapter 1, What Are Data Streams?
- Chapter 2, Messaging and Data Streaming in AWS
- Chapter 3, The SmartCity Bike-Sharing Service
Chapter 1: What Are Data Streams?
A data stream is a system where data continuously flows from multiple sources, just like water flows through a stream. The data is often produced and collected simultaneously in a continuous flow of many small files or records. Data streams are utilized by a wide range of business, medical, government, social media, and mobile applications. These applications include financial applications for the stock market and e-commerce ordering systems that collect orders and cover fulfillment of delivery.
In the entertainment space, live data is produced by sensing devices embedded in player equipment, video game players generate large amounts of data at a massive scale, and there are new social media posts thousands of times per second. Governments also leverage streaming data and geospatial services to monitor land, wildlife, and other activities.
Data volume and velocity are increasing at faster rates, creating new challenges in data processing and analytics. This book will detail these challenges and demonstrate how Amazon Kinesis can be used to address them. We will begin by discussing key concepts related to messaging in a technology-agnostic form to provide a solid foundation for building your Kinesis knowledge.
Incorporating data streams into your application architecture will allow you to deliver high-performance solutions that are secure, scalable, and fast. In this chapter, we will cover core streaming concepts so that you will have a detailed understanding of their application to distributed systems. You will learn what a data stream is, how to leverage data streams to scale, and examine a number of high-level use cases.
This chapter covers the following topics:
- Introducing data streams
- Challenges associated with distributed systems
- Overview of messaging concepts
- Examples of data streaming
Introducing data streams
Data streams are a way of storing a sequence of messages. They enable us to design systems where we think about state as a series of events instead of only entities and values, or rows and columns in a database. This shift in mindset and technology enables real-time analytics to extract the value from data by acting on it before it is stale. They also enable organizations to design and develop resilient software based on microservice architectures by helping them to decouple systems. We will begin with an overview of streaming data sources, why real-time data analysis is valuable, and how they can be used architecturally to decouple systems. We will then review the core challenges associated with distributed systems, and conclude with an overview of key messaging concepts and some high-level examples. Messages can contain a wide variety of information and come from different sources, so let's look at the primary sources and data formats.
Sources of data
The proliferation of data steadily increases from sources such as social media, IoT devices, web clickstreams, application logs, and video cameras. This data poses challenges to most systems, since it is typically high-velocity, intermittent, and bursty, making it difficult to adequately provision and design downstream systems. Payloads are generally small, except when containing audio or video data, and come in a variety of formats.
In this book, we will be focusing on three data formats. These formats include the following:
- JavaScript Object Notation (JSON)
- Log files
- Time-encoded binary files such as video
JSON streams
JSON has become the dominant format for message serialization over the past 10 years. It is a lightweight data interchange format that is easy for humans to read and write and is based on the JavaScript object syntax. It has two data structures – hash tables and lists. A hash table consists of key-value pairs, {"key":"value"}, where the keys must be unique. A list is a set of values in a specific order, ["value 1", "value 2"]. The following code sample shows a sample IoT JSON message:
{
"deviceid" : "device001",
"eventTime": -192778200,
"temp" : 68.4,
"humidity" : 77.3,
"coords" : {
"latitude" : 32.779039,
"longitude" : -96.808660
}
}
Log file streams
Log files come in a variety of formats. Common ones include Apache Commons Logging, Apache Combined Log, Apache Error Log, and RFC3164 Syslog. They are plain text, and usually each line, delineated by a newline ('\n') character, is a separate log entry. In the following sample log, we see an HTTP GET request where the IP address is 10.13.37.01, the datetime of the request, the HTTP verb, the URL fragment, the HTTP version, the response code, and the size of the result.
The sample log line in Apache Commons Logging format is as follows:
10.13.37.01 - - [03/Sep/2017:12:00:01 +0830] "GET /mailman/listinfo/test HTTP/1.1" 200 2457
Time-encoded binary streams
Time-encoded binary streams consist of a time series of records where each record is related to the adjacent records (prior and subsequent records). These can be used for a wide variety of sensor data, from audio streams and RADAR signals to video streams. Throughout this book, the primary focus will be video streams and their applications.
Figure 1.1 – Time-encoded video data
As shown in Figure 1.1, video streams are composed of fragments, where each fragment is a self-contained sequence of media frames. There are no dependencies between fragments. We will discuss video streams in more detail in Chapter 7, Kinesis Video Streams. Now that we've covered the types of data that we'll be processing, let's take a step back to understand the value of real-time data in analytics.
The value of real-time data in analytics
Analysis is done to support decision making by individuals, organizations, or computer programs. Traditionally, data analysis has been done on batches of data, usually in long-running jobs that occur overnight and that happen periodically at predetermined times: nightly, weekly, quarterly, and so on. This not only limits the scope of actions available to decisions makers, but it is also only providing them with a representation of the past environment. Information is now available seconds after it is produced, so we need to design systems that provide decision makers with the freshest data available to make timely decisions.
The OODA – Observe, Orient, Decide, Act – loop is a decision-making, conceptual framework that describes how decisions are made when reacting to an event. By breaking it down into these four components, we can optimize each to reduce the overall cycle time. The key idea is that if we make better decisions quicker than our opponent, we can outmaneuver them and win. By moving from batch to real-time analytics, we are reducing the observed portion of this cycle.
John Boyd
John Boyd was a USAF colonel and military strategist. He developed the OODA loop to better understand pilot combat operations. It has since been expanded and is used at a more strategic level by the military, sports teams, and businesses.
By reducing the OODA loop cycle time, new actions become available. They can be taken while events are unfolding and not merely responding to them after the event has occurred. These time-critica...