eBook - ePub

Kafka Up and Running for Network DevOps

Name: Kafka Up and Running for Network DevOps
Author: Eric Chou

Set Your Network Data in Motion

Eric Chou

Share book

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Kafka Up and Running for Network DevOps

Set Your Network Data in Motion

Eric Chou

Book details

Book preview

Table of contents

Citations

About This Book

Today's network is about agility, automation, and continuous improvement. In Kafka Up and Running for Network DevOps, we will be on a journey to learn and set up the hugely popular Apache Kafka data messaging system.Kafka is unique in its principle to treat network data as a continuous flow of information that can adapt to the ever-changing business requirements. Whether you need a system to aggregate log messages, collect metrics, or something else, Kafka can be the reliable, highly redundant system you want.

We will begin by learning about the core concepts of Kafka, followed by detailed steps of setting up a Kafka system in a lab environment. For the production environment, we will take advantage of the various public cloud provider offerings. Next, we will set up our Kafka cluster in Amazon Managed Kafka Service to host our Kafka cluster in the AWS cloud. We will also learn about AWS Kinesis, Azure Event Hub, and Google Cloud Put/Sub. Finally, the book will illustrate several use cases of how to integrate Kafka with our network from data enhancement, monitoring, to an event-driven architecture.

The Network DevOps Series is a series of books targeted for the next generation of Network Engineers who wants to take advantage of the powerful tools and projects in modern software development and the open-source communities.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Kafka Up and Running for Network DevOps an online PDF/ePUB?

Yes, you can access Kafka Up and Running for Network DevOps by Eric Chou in PDF and/or ePUB format, as well as other popular books in Computer Science & Systems Architecture. We have over one million books available in our catalogue for you to explore.

Information

Publisher

PublishDrive

Year

2021

ISBN

9781957046013

Topic

Computer Science

Subtopic

Systems Architecture

Index

Computer Science

Chapter 1. Kafka Introduction

As mentioned in the introduction section, Apache Kafka is a high-throughput, low-latency platform for handling real-time data feeds.

At first glance, ‘low-latency, high-throughput for real-time data feed’ might not look much. After all, every open-source project and commercial vendor (and their brother) can claim to be low-latency and high-throughput. But once you consider the type of companies using Kafka in their products and services, such as Uber, Netflix, LinkedIn, you quickly realize how significant that claim is. When we click on the like button on a LinkedIn post, it needs to appear on the post right away. That is low-latency. If we consider how many Netflix movies are streaming every second, that is high throughput. Of course, the customers of these companies expect all of the operations to take place in real-time.

According to Netflix, Kafka Inside Keystone Pipeline, “700 Billion messages are ingested on an average day” by their 400+ Kafka brokers. Did they say they process 700 Billion messages in a day in real-time? Or let’s also consider Uber’s use case, Real-Time Exactly-Once Ad Event Processing, of being a two-way marketplace for UberEats. In it, the message needs to be fast and reliable, but they also need to ensure the events are processed only once, with no overcount or undercount. The events need to be exactly once amongst all the consumers, full stop.

Kafka is excellent at how it can achieve its goals for these demanding projects. But how did this fantastic tool come about? First, let’s look into the history of Kafka.

History of Kafka

Kafka was originally developed at LinkedIn by Jay Kreps, Neha Narkhede, and Jun Rao (Wikipedia). As the story goes, Jay Kreps named the project Kafka because he likes the author Franz Kafka’s work. The author Franze Kafa has a ‘system optimzed for writing’ and Apache Kafka is also “a system optimzed for writing”.

The project was released as an open-sourced project with the Apache Software Foundation in early 2011 and went from incubation to top-level apache project on October 23, 2012. It is written in Java and Scala with significant community backing.

The three original developers left LinkedIn and found the company Confluent in 2014. The company aims to Set Data in Motion with (surprise!) Kafka is at the center of that idea. As a result, many of the Kafka-related projects, documentation, products, and initiatives are actively developed and sponsored by Confluent.

Kafka Use Cases

Within the Kafka architecture, at the center is the idea of event streaming. Software systems drive our world. These systems are interconnected, always-on, and automated. Kafka provides the centralized middle ground for these systems to exchange information, or events, in the form of topics (or categories). The producer systems can send events to a particular topic, while the consumer systems can receive these events via subscription.

We will use the term events and messages interchangeably in this book to refer to the data being exchanged by producers, consumers, and Kafka.

In the words of Kafka, event streaming is analogous to the central nervous system of the human body, which allows the connectivity of tissues between different parts of the body.

In terms of network engineering, in my opinion, can use Kafka event streaming in a few different scenarios:

We can use Kafka to process transactions in real-time, such as device provisioning from warehouse shipment to fully functional in a data center.
We can use Kafka to implement an event-driven architecture. Kafka can be used to track and analyze changes in network events, such as BGP neighbor relationships or interface flapping.
We can use Kafka to capture and analyze IoT and wireless sensor data continuously. This process can be done in a distributed fashion, with Kafka servers across different regions.
We can use Kafka to connect, store, and make available data produced by a single source to multiple destinations. An example would be to store a single set of network SNMP data in a Kafka topic, which multiple monitoring systems can consume. This allows us only to poll the network device once and reduce CPU and network overhead.

If we combine the above use cases, Kafka allows us to:

Continuously capture events
Connect different parts of the system
Immediate react to a change in system state
Minimizing the impact on the network devices

We will look at some of the disadvantages of Kafka in the next section.

Disadvantages of Kafka

If Kafka is so great, why doesn’t everybody use Kafka? Of course, no system can be perfect. Like many, if not all, system design approaches, the design of Kafa is a story of tradeoffs. What are some of the disadvantages of Kafka? Let’s take a look at a few of them:

Kafka clusters can be complex and hard to set up.
Managing a Kafka can have a high learning curve.
By design, Kafka does not contain some standard features found in other storage solutions. For example, Kafka does not by default have message validation for producers.
Kaka has a fast, evolving ecosystem that sometimes makes keeping systems up-to-date a challenge.

Even with the foretold disadvantages, in my opinion, the benefits of Kafka still outweigh the disadvantages. Let us take a look at some of the key concepts in Kafka.

Kafka Concepts

Kafka was developed with the newer data pipeline in mind, which treats data as a continuous stream. As a result, there are several parts and concepts related to the Kafka data streaming system:

In a distributed system, we need a way to build, manage, scaling, and maintain the group of distributed servers. The Kafka system uses Zookeepers, another open-source project, to manage the servers within the cluster. The Kafka servers containing the data themselves (Topics and events) are called Brokers.
The system allows for producing (write) and subscribing (read) the messages continuously. Hence, they are appropriately named Producers and Consumers. The producers and consumers generally take the form of SDK or APIs sitting in the servers communicating to the Kafka cluster. In this book, we will use the Python SDK and shell scripts as producers and consumers.
The system needs to store the events for some time. This step is generally in the form of Topics consisting of Partitions. Within the partitions, each event is labeled with a number called offset. This is the identification of the message we use to keep track of which the consumers have consumed.
The systems need to process the streams of events as they occur and react to any unforeseen circumstances, such as backing up partitions and reallocating them when a Broker is unexpectedly down. The components responsible for these processes are Zookeepers and Brokers.

Here is a generalized overview of the Kafka cluster:

Figure 1.1 Kafka Overview (Source: https://upload.wikimedia.org/wikipedia/commons/6/64/Overview_of_Apache_Kafka.svg)

We will go over the components in more detail. Let’s start with Zookeepers.

Zookeepers

Apache Zookeepers is in itself a popular open-source project under the Apache Software Foundation. Its primary function is to provide reliable distributed coordination between applications. Why is the project named Zookeeper, you asked? The project received its funny name because it started as a sub-project of Hadoop. Since many of the projects in Hadoop are named after zoo animals, Zookeeper received its name for its management function. What started as a Hadoop sub-project is now a top-level Apache project (at least in 2019) in its own right.

There can be multiple Zookeepers in a Kafka cluster, and the recommended number is three to five Zookeepers in a production Kafka cluster. The number should be an odd number to keep a quorum for leader election. However, the number should be kept as low as possible to minimize the overhead.

For more information on Zookeeper, please see Apache Zookeeper.

It is important to realize Zookeeper is a separate service with its configuration file and run time service for our purpose. It is also important to note that Kafka brokers require Zookeepers to function prior to be put into service. The Zookeeper keeps the state of the cluster, such as Brokers, Topics, users, and more.

Brokers

The Kafka Brokers are the workhorse of the Kafka cluster. Generally, a single Kafka server is one broker. We will see how we can run multiple brokers in a single machine later in the book, but that is more of a hack than an actual setup we would use in production. There has to be at least one Broker per Kafka cluster. Each broker has a broker ID that it uses to register with Zookeeper.

Kafka broker is where the producers and clients will communicate with the cluster when they need to write or read messages from a topic. They handle most of the requests from clients. The broker receives messages from producers, assigns offsets to them, and commits them to storage on disk. At this point, the broker would send a confirmation to the producer to signal the success of the message commit. The broker also services consumers. They would respond to message pull requests from the consumer.

Depending on the hardware, one broker can handle thousands of requests. We will have at least one broker per cluster, but having more than one broker allows redundancy and additional performance gain. Kafka brokers are designed to be operated as part of a cluster. Within a cluster, one broker will be elected as the controller. The controller is responsible for assigning partitions to brokers and monitoring other broker failures.

As we will see in the next section on Topics and Partitions, when we have multiple brokers, the same topic can be distributed into different partitions. A leader is elected in a partition to service messages. The partitions can also be assigned to multiple brokers, which can serve as replication for redundancy. The clients can have concurrent connections to multiple brokers for scalability.

Don’t worry too much about leaders and controllers between Kafka brokers at this point. For now, it is enough to know they exist and their general functions. The leadership election happens automatically within the cluster.

We have talked about Kafka messages can be retained on the Kafka cluster for some time. Once committed by the broker, the message by default is kept on the disk for seven days or when the topic reaches a certain size, 1 GB by default. Both of these parameters are configurable options on the broker. With the message being retained on the broker for a while, the consumers can be down for a bit of time before the message is deleted.

Topics, Partitions, and Offsets

A topic is simply a category or name of a feed. We can configure our cluster to allow automatic topic creation when the sender feeds our cluster a topic that does not exist. A good analogy for a topic would be a file folder on your computer. Just like we group related files into a folder, we group related messages into a topic.

Kafka’s topics are divided into several partitions. The multiple partitions per topic allow data to be split across multiple brokers. Having the message across numerous brokers allows parallel processing. When we want to increase the read-write performance, one of the options is to increase the number of brokers and partitions for our topics.

In the Figure below, we can see Topic A was divided into two partitions, and each partition has a replication factor of 2 for redundancy. The placement of the partitions are intelligently managed by Zookeeper between the three brokers:

Each of the partitions will contain the actual messages in an orderly fashion. The messages are immutable, meaning they cannot be changed once written to the partition. The messages are written to a partition in an append-only manner. Once the message is written to a partition, the broker will commit the message with a commit log. Please note that as each topic will likely have multiple partitions, the ordering of messages across the topic would not be guaranteed. However, if we have a key in the message, Kafka will put the message in the same partition, and the message ordering within that partition is guaranteed. We will see this in an example in the next chapter.

Each of the messages in the partition is assigned a number called offset:

The concept of offset is essential; this offset number gives a point of reference in the messages. It allows the Zookeeper to know when a producer sends a new message to an existing topic, where it should append a new message. The offset also allows the Kafka cluster...