Apache Kafka® and Apache Pulsar™ are 2 popular message broker software options. Although they share certain similarities, there are big differences between them that impact their suitability for various projects.
In this comparison guide, we will explore the functionality of Kafka and Pulsar, explain the differences between the software, who would use them, and why.
What Is Message Broker Software?
Message broker software is used to enable communication and information exchange between apps, systems, and services. It works by translating messages so one app can easily communicate with another third-party app. This is achieved using various messaging protocols.
This type of intermediary software module is known as messaging “middleware”, allowing developers to manage the flow of data between a piece of software’s components while they focus on the core functionality. Message brokers are effectively distributed communication layers that let applications communicate on an internal level, even if they are hosted on different platforms or written in completely different languages.
As well as delivering messages, the software can also store, validate, and route them to a specific destination. Messages are usually ordered in a message queue, only being sent when the sender and recipient can process them, minimizing any impact on performance and ensuring data is not lost.
Message Broker Models
This distribution pattern involves each message being published to an overall topic. Users can subscribe to various topics they wish to receive messages from, making them eligible to receive any messages published to these topics.
This model is effectively a broadcast system that allows a single message to be sent to many different subscribers, such as when an airline company informs customers of delays or flight changes.
Kafka vs. Pulsar: An Overview
Apache Kafka is a highly scalable, open source message broker software designed to analyze, read, broadcast, and store data.
Kafka’s process is very simple: it receives data from one or multiple “producer” components (sender) before sending the analyzed data to consumer groups (recipient). One or more producers can send messages to Kafka, and a consumer group can contain multiple consumers.
The Benefits of Apache Kafka
Apache Kafka has a range of key benefits that have established it as one of the world’s most popular message brokers:
- Messages can be broadcast across multiple servers, creating significant scope in terms of scalability
- Consistent performance levels and a reliable level of service and flexibility due to its pub/sub model
- Simple to set up and relatively easy to learn
- Messages can be delivered via multiple servers with impressive low latency
- Kafka is durable with potential faults are minimized by 2 main features:
- Server failure protection, which uses a fault-tolerant array to distribute streaming storage
- Kafka replicates data within the cluster, ensuring that messages are sent to disk and allowing them to be re-sent should a failure occur
- Real–time responses
Apache Pulsar is gaining popularity for its high performance and versatile features. Pulsar incorporates a multi-tenant solution that allows messages to be sent between servers, as well as in a queuing system that uses the publisher/subscriber model. It combines many features of Kafka and other messaging systems like RabbitMQ.
Apache Pulsar can be scaled horizontally to help meet demand. It can expand by hundreds of nodes to increase the number of necessary elements such as posts, themes, and additional storage.
It is also a cloud-native platform, making it an attractive proposition for organizations that are powered by the cloud and want to avoid physical infrastructure. In addition, Pulsar has a lower latency than Kafka and is less affected by peaks in productivity.
The Benefits of Apache Pulsar
- Uses IO connectors to ensure easy connections
- Enables messaging, streaming, and queuing functionality on one platform
- Compatible with cloud and Kubernetes architecture, offering greater levels of security
- Easily scalable, upwards and downwards
- It can support large numbers of users with its multi-tenant system
- Pulsar can be used to update legacy applications
Comparison of Kafka vs. Pulsar
Kafka uses a distributed commit log as its storage layer, and any new writes are added to the end of the log. The reads start from an offset and are sequential, with all data zero-copied from the disk buffer to the network buffer. This makes it effective as an event streaming solution.
Pulsar uses an index-based storage system that stores data in a hierarchical, tree-like structure to provide quick access to individual messages. Although this structure enables fast individual reads, it impacts the latency/throughput of the write overhead when compared to a log system.
Ease Of Use
The success of any platform is very reliant on its usability and the level of support offered. On this front, there is a clear winner: Kafka.
Kafka is the less complex of the 2, using cluster-based technology with a medium-weight architecture made up of 2 key components: ZooKeeper servers and Kafka’s own servers (brokers).
Although ZooKeeper adds some complexity to the platform, Kafka’s active community of developers has been working on ways to remove the component. We will talk more about community support in the next subsection.
Kafka also features 2 mature Kubernetes operators (an open source operator and a commercial one) to help simplify cluster management.
Pulsar, as previously mentioned, is more complex and built on a heavy-weight architecture that requires 4 components to be configured and managed. These components include the Pulsar servers, Apache BookKeeper™, Apache ZooKeeper™, and the RocksDB database.
Documentation and Support
Kafka’s documentation consists of over half a million words, different textbooks, numerous text and video tutorials, demos, and podcasts. Tens of thousands of Kafka-related questions have been asked on Stack Overflow alone, while there are also multiple online courses available from providers such as Udemy.
In comparison, there are noticeably fewer resources found online for Pulsar, including documentation and guidance.
Kafka has a very active community of developers with impressive levels of support provided by numerous Slack channels. Many areas of the platform are covered online, helping new users understand Kafka quickly.
The Pulsar community is much smaller in comparison, although it is getting stronger. Unfortunately, community-led support is still much harder to find. Only around 140 Pulsar-related questions can currently be found on Stack Overflow, and Pulsar has a Slack community size of 2,300+ members, compared to more than 23,000 for Kafka.
Managed Cloud Offerings
Instaclustr offers a 100% open source managed Kafka platform, with the added benefit of flexible hosting—either on-prem or in your cloud of choice. The 3 major cloud providers— Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), also offer managed enterprise versions of Kafka. Pulsar’s most popular cloud service is provided by Clever Cloud.
Throughput, Latency, Scalability, and Availability
Kafka offers high network speeds and can process trillions of messages every day. A large Kafka deployment can be made up of hundreds of servers, as has been demonstrated by global players such as Netflix and LinkedIn.
Pulsar has excellent scalability but is somewhat hindered by its storage architecture, which is much more complex than Kafka.
Kafka and Pulsar both use fragmented partitions to provide high availability across all machines and zones.
Kafka features reliable data storage and works similarly to a traditional database, so information can be stored indefinitely. Data is retained based on how the user configures each topic.
Pulsar deletes messages as soon as they have been consumed by default, but the BookKeeper storage layer allows for the long-term storage of data. However, the ZooKeeper components’ use of metadata limits data retention on the platform.
Global Data Replication
Both platforms have global data replication features, whether that is between 2 data centers or 2 different geographical locations in the cloud. Kafka and Pulsar both replicate data in parallel, which allows for high throughput.
Built-In Stream Processing
Kafka comes with stream processing functionality, allowing developers to integrate flexible client applications via the Kafka Stream Library. This allows for the support of valuable stream processing features such as aggregations, tables, state management, and more.
Pulsar only offers basic stream processing functions, allowing for simple callbacks, which means the platform cannot be deemed as having full stream processing capabilities.
Both platforms can offer permanent data storage, which allows this data to be reprocessed for A/B testing, analytical modeling, debugging, and auditing purposes.
Pulsar supports exactly-once processing, which is when a message is read and then written to a secondary topic. Kafka offers high-throughput exactly-once, at-least-once, and at-most-once processing which increases the number of possible use cases of the platform.
Topic (Log) Compaction
Kafka offers native topic compaction, reducing the log down to the latest version of messages which share the same key (this function is also available for all brokers). Apache Pulsar offers similar functionality that runs over the network and streams the data from the storage layer to the broker layer. This is a less seamless option, effectively creating a snapshot of the previously compacted topic while the original topic remains unaltered.
Mission Critical Use Cases
Kafka is used by considerably more organizations across the world than Pulsar, spanning an impressive range of industries. These include stock exchanges and banking, internet companies, healthcare, manufacturing, etc.
Kafka excels in event streaming, incorporating tools that enable the creation of streaming pipelines, features for processing events, ordered parallel message delivery, and more. Pulsar offers the majority of these features, with the exception of event streaming pipelines.
Kafka delivers messages based on the order of the messages within the partition. At the same time, Pulsar supports a message queuing API that delivers messages to competing consumers in a turn-based system.
Kafka is very effective when it comes to server-side message routing thanks to the Kafka® Connect and Kafka Streams components. These 2 components enable message transformation and enrichment, as well as content-based routing.
Pulsar is more restricted in terms of routing capabilities but runs these functions within the broker instead of on a separate layer like Kafka.
Apache Kafka vs. Apache Pulsar: Concluding Thoughts
Kafka has been designed as a market-leading distributed log to enable highly-scalable event streaming. In contrast, Pulsar is somewhat of a hybrid between a distributed log and a traditional messaging system like RabbitMQ. Pulsar adopts some of the key features of the Kafka platform, but its core function is to quickly send messages and delete them as soon as they have been consumed.
When it comes to event streaming, Kafka is far and away the stronger choice, excelling in terms of throughput, scalability, and message storage. In the future, Apache Pulsar could be an option for solutions that require both event streaming and message queuing, but right now, it struggles to rival Kafka in most areas.