Apache Kafka® is the leading distributed streaming and queuing technology for large-scale, always-on applications. Kafka has built-in features of horizontal scalability, high-throughput and low-latency. It is highly reliable, has high-availability, and allows geographically distributed data streams and stream processing applications.
Donated to the Apache Foundation by LinkedIn in 2011, Kafka has garnered lot of interest and is now being broadly used by many organisations across the globe including Netflix, Twitter, Spotify and Uber. Kafka has grown to a strong and vibrant open community and it is compatible with a wide range of complementary technology.
Kafka has a similar shared-nothing, replicated architecture to Cassandra that allows it to operate with similar extreme levels of scalability, reliability and availability. In any big- data application, Kafka really has three core functions:
Enabling transportation of data between various publisher and subscriber endpoints.
Aggregating a number of various data streams for use by distributed processing applications.
Storing data streams as a cache in a replicated, fault-tolerant storage environment.
Use of Messaging Queuing Technology
Messaging queuing technology has been in use for some time now – enabling different applications and endpoints to communicate with each other. These applications can be on the same or different physical platforms. With queuing technology, the producer and consumer do not need to interact with the message queue at the same time. Messages placed onto the queue are stored until consumer retrieves them.
Kafka is part of general family of technologies that provide queuing, messaging and/or streaming – sometimes known as a streaming engine. Fundamentally it is a system that takes streams of messages from applications known as producers, stores them reliably on a central cluster (known as the brokers) and allows those streams to be received by applications that process the messages (applications called consumers). Other examples of this broad technology family include technologies such as RabbitMQ, IBM MQ, and Microsoft Message Queue.
A message queuing technology is deployed in an application stack for the following reasons:
Messaging buffering. To provide a buffering mechanism in front of a processing (i.e. to deal with temporary incoming message rates that are greater than what processing app can deal with)
Guarantee of message delivery. Allows producers to publish messages with some assurance that the message will eventually be delivered, if if the consuming application is unable to receive the message when it is published.
Providing abstraction. This type of technology provides an architectural separation between the consumers of messages and the applications that are producing the messages.
Enabling scale. Provides a flexible and highly configurable architecture that enables many producers delivering messages to multiple consumers.
Apache Kafka Advantages
Most Java Message Service (JMS) brokers either don’t persist messages at all (i.e. non-persistent) or only store them until they are consumed and acknowledged (persistent). When comparing the traditional system v/s Kafka, Kafka has a completely different model where it stores all messages before and even after they are successfully received by subscribing applications.
Kafka comes with the following advantages:
Highly reliable due to built-in replication, supporting true always-on operation
Enables stream processing applications to utilise geographically distributed data streams
Can handle high-velocity and high-volume data
Achieves high throughput and high availability using a distributed cluster of servers, therefore supporting message throughput of thousands of messages per second
Low latency for most of the new age applications, using load balancing and data replication to allow failure or planned maintenance of individual nodes with no downtime
Distributed technology and JVM codebase similar to Cassandra
Built-in optimization such as compression and message batching
Powers some of the biggest organisations across the globe.
Some Key Kafka Concepts
Producer: Producers are applications that publish a stream of records to one or more Kafka topics.
Consumer: Consumers are the applications that read data from Kafka topics.
Kafka Brokers:Brokers are the Kafka “servers”. They store data provided by the producer and make it available to the consumer. Kafka brokers require Apache Zookeeper deployment to store configuration data, topic offsets, consumer groups, and other information. Kafka replicates its logs over multiple servers for fault-tolerance. Each Kafka Broker has a unique ID (number). Kafka Brokers contain topic log partitions.
Kafka Topic and Partition: Topic is a stream of data, and is composed of individual records, basically just a sharded write-ahead log. Producer append records to these logs and consumer subscribe to changes. Kafka topics are divided into a number of partitions that allows one to parallelize a topic by splitting the data in a particular topic across multiple brokers. Each partition can be placed on a separate machine to allow for multiple consumers to read from a topic in parallel and also allowing a topic to hold more data that can fit on any one machine.
Kafka Cluster: Apache Kafka is made up of a number of brokers that run on individual servers coordinated Apache Zookeeper. You can start by creating a single broker and add more as you scale your data collection architecture. A Kafka cluster can have, 10, 100, or 1,000 brokers in a cluster, if needed. Apache Kafka uses Apache Zookeeper to maintain and coordinate the Apache Kafka brokers.
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It is an API and ecosystem of 3rd party connectors that enables Kafka to be easily integrated with other systems without developers having to write any extra code.
Paul Brebner, Technology Evangelist at Instaclustr wrote a series of blogs around Apache Kafka. Paul’s blog around Apache Kafka Connect Architecture covers Source and Sink Connectors, Connectors, Plugins, Task and Worker, Clusters and Convertors.Paul wrote a series of blogs on Instaclustr Kongo IoT Application that looks at implementing Kafka to build an IoT logistic applications.
Two of his blog in the series are devoted to understanding Kafka Connect API. The first blog focuses on use cases extending the Kongo IoT application to stream events from Kafka to Apache Cassandra using a Kafka Connect Cassandra Sink. Part 2 of the blog, covers useful Kafka Connect resources. Interesting read for an in-depth perspective on Kafka Connect.
In the most simple of ways, producers send records to the Kafka clusters which stores these records and then passes them in a controlled manner to consuming applications.
Kafka’s main architectural components include Producers, Topics, Consumers, Consumer Groups, Clusters, Brokers, Partitions, Replicas, Leaders, and Followers. A simplified UML diagram describes the ways these components relate to one another and helps you develop an understanding of Kafka Architecture.
Important things to note:
Kafka clusters can have one or more brokers.
Brokers can host multiple replicas.
Topics can have one or more partitions.
A broker can host zero or one replica per partition.
A partition has one leader replica and zero or more follower replicas.
Each replica for a partition needs to be on a separate broker.
Every partition replica needs to fit on a broker, and a partition can’t be divided over multiple brokers.
Every broker can have one or more leaders, covering different partitions and topics.
Kafka enables developers to meet important requirements for applications such as:
Safely dealing with spikes in workloads by temporarily storing data until it’s ready for processing;
Allowing data from one source to be processed by many downstream applications;
Allowing data from many sources to be gathered in a single sport for processing by downstream applications; and
Facilitating stream processing where calculations can be performed on the streams of data such as calculating average values over a time window (as well as much more complex things).
The part 2 and part 3 of Instaclustr Kongo IoT applications blog re-engineers the Kongo application to introduce Apache Kafka. The series explores designing and developing and example IoT application with Apache Kafka to illustrate the typical design and implementation considerations and patterns.
Kafka Streams is a framework for stream data processing. It is a client library for processing and analysing data store in Kafka. Kafka Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, transforming the input streams to output streams. It is the easiest way to write mission-critical real-time applications and microservices that power your core business.
Kafka Streams are highly scalable, distributed, fault-tolerant, elastic applications. With Kafka Streams, we can process the stream data within Kafka. No separate cluster is required just for processing.
Kafka can be used for the variety of use cases such as generating matrix, log aggregation, messaging, audit trail, stream processing, website activity tracking, monitoring and more.
For example in the Event sourcing, you consider the sequence of changes made (as opposed to the result of those changes) to be the source of truth for the state of your application. Kafka is really useful for this use case, as it is designed for reliably storing a series of events and can provide an ideal data store for this purpose. Kafka’s support for compaction can also assist in this case.
Apache Kafka is the leading stream processing engine for scale and reliability, Apache Cassandra , a well-known database for powering the most scalable, reliable architectures available and Apache Spark is the state-of-the-art advanced and scalable analytics engine.
Deployed together, these three technologies give developers the building blocks needed to build reliable, scalable and intelligent applications that adapt based on the data they collect.
Our Pick’n’Mix blog is an interesting read if you are interested to gain understanding on Instaclustr managed solutions that can be used together.
How Real-time Streaming Data Helps
Kafka supports write and read scalability at the same time, which essentially means developers can stream enormous amounts of data to Kafka and carry out a real-time processing of the messages, including sending messages to other systems, for multiple different purposes concurrently. The use is only limited by your imagination.
Kafka has evolved from a messaging queue to a versatile streaming platform. If you are using Kafka, you would need to be aware of a few rules that would help you manage Kafka pipelines.
Download our white paper on 10 golden rules for managing Kafka to understand various components and gain in-depth knowledge on rules for log, hardware requirements, Zookeeper, replication and redundancy and more
Deploying Kafka on AWS & Other Providers
Instaclustr offers Managed Apache Kafka on AWS, Microsoft Azure, Google Cloud Platform and IBM cloud . You can choose between different developer node sizes and professional node sizes.
When deciding on deploying Apache Kafka as a crucial part of your technology stack you have to consider capability management requirements. You need to decide if the right decision is to build the necessary in-house expertise, or should you outsource the responsibility to a managed service provider (MSP) who can deliver Kafka-as-a-service.
Our consulting package for Apache Kafka is designed to provide an independent review of your operational Apache Kafka deployment. The package is designed to help customers who already have deployed and have operational application and are looking for assurance that the solution is scalable and is free from any potential issues.