Computer Science

Apache Kafka

Apache Kafka is an open-source distributed event streaming platform that is used to handle real-time data feeds. It is designed to handle high volumes of data and provides a scalable, fault-tolerant architecture for processing and storing data. Kafka is widely used in big data and real-time analytics applications.

Written by Perlego with AI-assistance

9 Key excerpts on "Apache Kafka"

  • Book cover image for: Data Lake for Enterprises
    • Tomcy John, Pankaj Misra(Authors)
    • 2017(Publication Date)
    • Packt Publishing
      (Publisher)
    The real-time data from business applications that we are going to handle is customer’s, behavioral data coming from user interaction with enterprise's website. For example, data such as page visits, link clicks, location details, browser details and so on, will flow into Flume. Using the publish subscribe capability of Kafka it is then streamed to the Data Ingestion Layer. The Data Ingestion Layer will handle multi-target ingestion, where one path goes to the Data Storage Layer (HDFS) and the other goes to Data Ingestion Layer for required processing as needed. What is Apache Kafka? Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a "massively scalable pub/sub message queue architected as a distributed transaction log," making it highly valuable for enterprise infrastructures to process streaming data. - Wikipedia The next sections of this chapter will definitely give you more details on what Kafka is in detail. In one sentence, Kafka gives a level of indirection, by which it disconnects the source from the consumer and also gives capability which a messaging layer should process, as detailed in the previous section. Why Apache Kafka We are using Apache Kafka as the stream data platform (MOM : message-oriented middleware). Core reasons for choosing Kafka is it's high reliability and ability to deal with data with a very low latency. Message-oriented middleware (MOM) is software or hardware infrastructure supporting the sending and receiving of messages between distributed systems. - Wikipedia Apache Kafka has some key attributes attached to it making it an ideal choice for us in achieving the capability that we are looking to implement the Data Lake. They are bulleted below: Scalability : Capable of handling high-velocity and high-volume data
  • Book cover image for: Machine Learning
    eBook - PDF

    Machine Learning

    Hands-On for Developers and Technical Professionals

    • Jason Bell(Author)
    • 2020(Publication Date)
    • Wiley
      (Publisher)
    One name that kept being repeated was Apache Kafka, and that’s what we’ll concentrate on in this chapter. What Is Kafka? Kafka is a stream processing platform. It was originally developed by LinkedIn and then open sourced in 2011. Kafka provides a high-performance, fault-tolerant streaming data service. It acts on a publisher/subscriber (pub/sub) message queue, and if you want real-time data feeds, then Kafka is an option that should be seriously considered. How Does It Work? The true power behind Kafka is that it’s scalable and can be run on one or multiple servers, known as brokers . Messages are sent to topics, producers send messages to the broker, and consumers take messages from the broker. To the producers and consumers subscribed to the system, it would appear as a stand-alone processing engine, but production systems are built on many 242 Chapter 12 ■ Machine Learning Streaming with Kafka Kafka Cluster Producer Producer Producer Consumer Consumer Consumer Figure 12.2: Relationship of producers to the Kafka cluster and consumers machines. It can handle millions of messages of throughput, dependent on the physical disk and RAM on the machines, and is fault tolerant. Messages are sent to topics, which are written sequentially in an immutable log. Kafka can support many topics and can be replicated and partitioned (see Figure 12.1). Once the records are appended to the topic log, they can’t be deleted or amended. It’s a simple data structure where each message is byte encoded. Producers and consumers can serialize and deserialize data to various formats. Messages are sent to the Kafka cluster by producers, and these messages are stored by the broker in topics. Consumers subscribe to topics and poll the topic for messages to read. The broker nodes are dumb; it’s the producers and consumers that are doing the smart work (see Figure 12.2). Topics can grow in size, and they can be split into partitions (see Figure 12.3).
  • Book cover image for: Microservices with Go
    In this section, we are going to introduce you to Apache Kafka, a popular message broker system that we are going to use to establish asynchronous communication between our microservices. You will learn the basics of Kafka, how to publish messages to it, and how to consume such messages from the microservices we created in the previous chapters.

    Apache Kafka basics

    Apache Kafka is an open source message broker system that provides the ability to publish and subscribe to messages containing arbitrary data. Originally developed at LinkedIn, Kafka has become perhaps the most popular open source message broker software and is used by thousands of companies around the world.
    In the Kafka model, a component that publishes messages is called a producer . Messages are published in sequential order to objects called topics . Each message in a topic has a unique numerical offset in it. Kafka provides APIs for consuming messages (the component for consuming messages is called a consumer ) for the existing topics. Topics can also be partitioned to allow multiple consumers to consume from them (for example, for parallel data processing).
    We can illustrate the Kafka data model in the following diagram:
    Figure 6.2 – The Apache Kafka data model Having such a seemingly simple data model, Kafka is a powerful system that offers lots of benefits to its users:
    • High write and read throughput : Kafka is optimized for highly performant write and read operations. It achieves this by doing as many sequential writes and reads as possible, allowing it to efficiently make use of hardware such as hard disk drives, as well as sequentially sending large amounts of data over the network.
    • Scalability : Developers can leverage topic partitioning provided by Kafka to achieve more performant parallel processing of their data.
    • Flexible durability
  • Book cover image for: Apache Kafka 1.0 Cookbook
    • Raúl Estrada, Sandeep Khurana, Brian Gatt, Alexey Zinoviev(Authors)
    • 2017(Publication Date)
    • Packt Publishing
      (Publisher)
    message broker . Kafka is a software solution to deal with routing messages among consumers in a quick way.
  • The message broker has two directives: the first is to not block the producers, and the second is to isolate producers and consumers (the producers should not know who their consumers are).
  • Kafka is two things: a real-time, publish-subscribe solution, and a messaging system. Moreover, it is a solution: open source, distributed, partitioned, replicated, commit-log based, with a publish-subscribe schema.
  • Before we begin it is relevant to mention some concepts and nomenclature in Kafka:
    • Broker : A server process
    • Cluster : A set of brokers
    • Topic : A queue (that has log partitions )
    • Offset : A message identifier
    • Partition : An ordered and immutable sequence of records continually appended to a structured commit log
    • Producer : Those who publish data to topics
    • Consumer : Those who process the feed
    • ZooKeeper : The coordinator
    • Retention period : The time to keep messages available for consumption
    In Kafka, there are three types of clusters:
    • Single node : Single broker
    • Single node : Multiple Broker
    • Multiple node : Multiple Broker
    There are three ways to deliver messages:
    • Never redelivered : The messages may be lost
    • May be redelivered : The messages are never lost
    • Delivered once : The message is delivered exactly once
    There are two types of log compaction:
    • Coarse grained : By time
    • Finer grained : By message
    The next six recipes contain the necessary steps to make a full Kafka test from zero. Passage contains an image

    Installing Kafka

    This is the first step. This recipe shows how to install Apache Kafka.
    Passage contains an image

    Getting ready

    Ensure that you have at least 4 GB of RAM in your machine; the installation directory will be /usr/local/kafka/ for Mac users and /opt/kafka/ for Linux users. Create these directories.
    Passage contains an image

    How to do it...

    Go to the Apache Kafka home page at http://kafka.apache.org/downloads , as in Figure 1-1 , Apache Kafka download page :
    Figure 1-1. Apache Kafka download page
  • Book cover image for: Hands-On Software Architecture with Golang
    No longer available |Learn more

    Hands-On Software Architecture with Golang

    Design and architect highly scalable and robust applications using Go

    Apache Kafka is a streaming-messaging platform that was first built at LinkedIn but is now a first-class Apache project. It offers seamless durable distribution of messages over a cluster of brokers, and the distribution can scale with load. It is increasingly used in place of traditional message brokers, such as AMQP, because of its higher throughput, simpler architecture, load-balancing semantics, and integration options.
    Passage contains an image

    Concepts

    In Kafka, topic is a formal name for queues where messages are published to and consumed from. Topics in Kafka offer the virtual topic queuing model described previously, that is, where there are multiple logical subscribers, each will get a copy of the message, but a logical subscriber can have multiple instances, and each instance of the subscriber will get a different message.
    A topic is modeled as a partitioned log, as shown here: Source: http://kafka.apache.org/documentation.html#introduction
    New messages are appended to a partition of a log. The log partition is an ordered, immutable list of messages. Each message in a topic partition is identified by an offset within the list. The partitions serve several purposes:
    • A log (topic) can scale beyond the size of a single machine (node). Individual partitions need to fit on a single machine, but the overall topic can be spread across several machines.
    • Topic partitions allow parallelism and scalability in consumers.
    Kafka only guarantees the order of messages within a topic partition, and not across different partitions for the same topic. This is a key point to remember when designing applications.
    Whenever a new message is produced, it is durably persisted on a set of broker instances designated for that topic partition—called In-Sync Replicas (ISRs
  • Book cover image for: Implementing Event-Driven Microservices Architecture in .NET 7
    eBook - ePub

    Implementing Event-Driven Microservices Architecture in .NET 7

    Develop event-based distributed apps that can scale with ever-changing business demands using C# 11 and .NET 7

    • Joshua Garverick, Omar Dean McIver(Authors)
    • 2023(Publication Date)
    • Packt Publishing
      (Publisher)
    streams built in. In essence, streams are exactly what they sound like—a stream of information. Kafka provides a stream API that allows you to transform records being written into input topics and place those transformed records into output topics. A stream can be programmatically created based on a specific topic.
    On the other hand, Tables are constructs that use data made available by streams and present that data in a specific and intentional way. Using common techniques such as mapreduce alongside Kafka-centric operations, you can create transformations that enrich or compact records with the intention of using another stream to write those records to output topics.
    While these constructs are important and valuable as utilities within Kafka, you might find their use within domains as an augmentation to event handlers. Kafka does offer a connect API that allows you to set up connections with a variety of different destination systems and data stores. If the plan is to only consider and use Kafka as a messaging platform, using streams and tables along with the connect API could make a lot of sense. For our example application, while Kafka is what we are using, there are abstractions in place that allow us to use different platforms for messaging and persistence, leaving the logic to the event handlers within the domain.

    Aggregate storage

    In the domain context, we have already identified several different possible aggregates that would need to be tracked. A common method for tracking changes to an aggregate when using a platform such as Kafka would be to have a topic for each aggregate. Events related to each aggregate can be handled by specific event handlers. Those handlers can then update the data stores as needed. Figure 2.2
  • Book cover image for: Scalable Data Architecture with Java
    https://github.com/PacktPublishing/Scalable-Data-Architecture-with-Java/tree/main/Chapter06/SQL .
    Reference notes
    If you are new to Kafka, I recommend learning the basics by reading the official Kafka documentation: https://kafka.apache.org/documentation/#gettingStarted . Alternatively, you can refer to the book Kafka, The Definitive Guide , by Neha Narkhede , Gwen Sharipa , and Todd Palino .
    In this section, we set up the Kafka streaming platform and the credit record database. In the next section, we will learn how to implement the Kafka streaming application to process the application event that reaches landingTopic1 in real time.

    Developing the Kafka streaming application

    Before we implement the solution, let’s explore and understand a few basic concepts about Kafka Streams. Kafka Streams provides a client library for processing and analyzing data on the fly and sending the processed result into a sink (preferably an output topic).
    A stream is an abstraction that represents unbound, continuously updating data in Kafka Streams. A stream processing application is a program written using the Kafka Streams library to process data that is present in the stream. It defines processing logic using a topology. A Kafka Streams topology is a graph that consists of stream processors as nodes and streams as edges. The following diagram shows an example topology for Kafka Streams:
    Figure 6.2 – Sample Kafka Streams topology
    As you can see, a topology consists of Stream Processors
  • Book cover image for: Building Enterprise IoT Applications
    • Chandrasekar Vuppalapati(Author)
    • 2019(Publication Date)
    • CRC Press
      (Publisher)
    10

    Middleware

    This Chapter Covers:
    • Apache Kafka
    • Middleware
    • Apache Spark
    • Installation of Kafka, Spark, and Zookeeper
    Real time processing deals with streams of data that are captured in real-time and processed with minimal latency. The processing of the streams data is split on a distance basis: Closer to the source, Edge level, and traditional cloud or high compute levels. Nevertheless, both processing architectures need a message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable delivery, and other message queuing semantics.
    For IoT events, IoT Data sources are ingested to Cloud systems for processing at Cloud level. The Data sources are ingested into either topic based middle ware, such as Kafka, or inserted into database for processing (see Figure 1 ).
    In this chapter, we will go through the creation of simple Kafka based message system that ingests data into and is processed by Spark or Scala based systems.
    Figure 1: Data Ingestion

    Message Architectures

    The Message Event architectures use central messaging stream Hubs, such as the IoT Hub, to ingest data in real-time. The events ingested into the IoT Hub are relayed to Stream servers such as Kafka (see Figure 2 ) to be queued in the event architecture.1
    1     Azure Reference Architecture - https://azure.microsoft.com/en-us/services/hdinsight/apache-kafka/
    Figure 2: Message Event Architectures
    The data from the Kafka topics are consumed by backend processors such as Spark to route analytics engine and thereafter to database or insights delivery to the client.
    Streaming Patterns
    Streaming patterns relay on high throughput ingestion and Complex event processors. Events are ingested into Stream processors at a high frequency and the events are processed through Complex Event processors with very low latency. The subsequent operations results in event processing and analytics (see Figure 3
  • Book cover image for: Software Mistakes and Tradeoffs
    eBook - ePub

    Software Mistakes and Tradeoffs

    How to make good programming decisions

    • Tomasz Lelek, Jon Skeet(Authors)
    • 2022(Publication Date)
    • Manning
      (Publisher)
    circuit breaker. Therefore, our architecture will still be operational. Before we start to understand delivery semantics in such an event-driven architecture, let’s start by understanding the basics of Apache Kafka.

    11.2 Producer and consumer applications based on Apache Kafka

    Before we start analyzing delivery guarantees from the consumer and producer side, let’s understand Apache Kafka architecture’s basics. The main construct used by both producer and consumer sides is a topic. The topic is a distributed, append-only data structure. The distribution is achieved via the topic’s partitioning. A topic can be split into N partitions; the more partitions it has, the more distributed processing it will have. Let’s assume we have a topic with topicName and four partitions (figure 11.5). Partitions are numbered from 0 upward.
    Figure 11.5 Topic structure as a distributed, append-only log
    Each partition has its own offset that identifies precisely one record in the append-only structure. When the producers send a new record to a topic, the producer first calculates the partition to which the record should be routed. Each record consists of a key-value pair.
    The key determines the partitioning for a given record. It can, for example, contain only the user_id . When partitioned by user_id , Kafka guarantees that all events for a single user are sent to the same partition. Because of that, the ordering of events for a specific user_id will be kept. In real pub-sub systems, we can have a lot of topics. One topic can have account data, another can have information about payment, and so forth.
    When the producer writes its message, it appends it to the end of the given partition. For example, if the partitioning algorithm determines that the event should be sent to partition 0, it will be appended to the end of this log. The offset of the new record will be equal to 13. It’s worth noting that we may end up in a situation where one partition processes too much data in the case of partition skew
  • Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.