What is Apache Kafka?
Apache Kafka is a distributed message streaming platform designed to build real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and fast, which makes it a popular choice for handling real-time data feeds. Kafka’s architecture enables massive streams of records to be sorted and processed across clusters of servers.
Apache Kafka was originally developed by LinkedIn, and later open-sourced in 2011. Since then, it has been adopted by leading technology companies for various data streaming requirements. It is designed to handle data streams from websites, applications, sensors, and other sources, enabling users to process and analyze data in real-time.
To operate Kafka, producers send records to topics, and consumers read records from topics. Topics are split into partitions, and each partition is an ordered and immutable sequence of records, enabling data to be distributed across multiple servers for parallel processing.
Editor’s note: Updated to reflect Apache Kafka version 4.
Free download: Apache Kafka insights e-Book [Access now]
This is part of a series of articles about Apache Kafka-IR
Best practices for Apache Kafka deployment and configuration
Deploying Apache Kafka in a production environment requires careful planning and adherence to best practices. These practices are designed to maximize the performance, reliability, and durability of your Kafka deployment.
1. Use a single topic per application
While Kafka allows you to use multiple topics, this is not always the best approach. Using multiple topics can increase the complexity of your application and make it harder to manage and monitor.
This is because every topic in Kafka is divided into partitions. Every partition is an independent unit of storage and processing. The more topics you have, the more partitions you’ll need to manage. This can lead to increased resource usage and potential performance issues.
Therefore, it’s recommended to use a single topic for each application. This approach can significantly reduce the complexity of your application and make it easier to manage. It can also improve the performance of your application, as fewer partitions mean less resource usage and better throughput.
2. Set appropriate retention for your topics
Retention is the period for which Kafka retains messages in a topic before they are deleted. By default, Kafka retains all messages for seven days, but this can be adjusted based on your requirements.
Setting the correct retention period is crucial for ensuring the availability of data and the performance of your application. If the retention period is too short, you may lose important data. On the other hand, if it’s too long, it can lead to increased storage usage and potential performance issues.
Therefore, it’s recommended to carefully consider your data consumption patterns and set the retention period accordingly. For example, if your consumers consume data in real-time, a shorter retention period may be appropriate. But if your consumers need to access historical data, a longer retention period may be required.
3. Use parallel processing
Parallel processing is a key feature of Kafka that allows you to process data simultaneously across multiple threads or processes. This can significantly improve the performance of your application, especially when dealing with large volumes of data.
To leverage parallel processing in Kafka, you can use multiple consumer groups. A consumer group can have multiple consumers, each consuming data from a different partition. This allows you to process data in parallel across multiple consumers, thereby improving throughput and reducing processing time.
However, it’s important to ensure that the number of consumers in a group does not exceed the number of partitions in a topic. If there are more consumers than partitions, some consumers will be idle and won’t receive any data. Therefore, it’s recommended to carefully plan your consumer groups and partitions to maximize parallel processing.
4. Set log configuration parameters to keep logs manageable
Logs record all data changes in a topic. Managing these logs effectively is essential for maintaining the performance and reliability of your Kafka deployment.
Kafka provides several configuration parameters that you can adjust to manage your logs. For example, you can set the log retention period, the maximum size of a log segment, and the frequency of log cleanup. These settings can help you control the size of your logs and prevent them from becoming too large.
5. Run Kafka in KRaft mode
Kafka’s KRaft (Kafka Raft metadata) mode replaces the legacy ZooKeeper dependency with an internal Raft-based consensus mechanism, simplifying cluster architecture and operations. In KRaft mode, a subset of Kafka nodes act as controllers that maintain the metadata quorum and coordinate cluster state, while other nodes act as brokers serving producers and consumers.
Because metadata management is built directly into Kafka, you no longer need to deploy and maintain a separate ZooKeeper ensemble. This not only reduces operational complexity but also improves scalability, metadata performance, and failover behavior compared with ZooKeeper-based deployments.
To deploy effectively in KRaft mode in production:
- Configure controller quorum: Ensure an odd number of controller nodes (e.g., 3 or 5) to form a Raft quorum that tolerates controller failures without interrupting cluster availability.
- Separate roles when needed: For larger clusters, consider dedicating nodes specifically as controllers, brokers, or combined roles. Dedicated roles isolate metadata coordination from data traffic and can improve reliability under load.
- Use the correct Kafka version: Run a Kafka version where ZooKeeper has been fully removed (e.g., Kafka 4.x), and verify that tooling and clients are compatible with KRaft.
- Tune listeners and networking: Configure separate listeners for controller traffic (
CONTROLLER) and client/broker traffic (PLAINTEXT/SSL) to avoid contention and improve observability. - Monitor controller health: Track KRaft-specific metrics related to Raft consensus and metadata replication alongside standard Kafka metrics to ensure cluster health and performance.
Adopting KRaft mode streamlines deployment and aligns Kafka with modern distributed systems best practices, making your cluster easier to scale, operate, and maintain.
6. Configure and isolate Kafka with security in mind
Kafka offers several features to help secure your deployment, such as SSL/TLS for encrypted communication, simple authentication and security layer (SASL) for authentication, and access control lists (ACLs) for authorization. You should use these features to secure your Kafka cluster from both external and internal threats.
For example, you can use SSL/TLS to encrypt the traffic between your Kafka brokers and clients to prevent eavesdropping, and you can use ACLs to control who can produce and consume messages from your Kafka topics.
Isolating your Kafka cluster is also important for security. You should run your Kafka brokers in a separate network segment, and limit access to this segment to only the necessary clients and administrative tools. This will help mitigate the risk of a security breach and protect your Kafka data.
7. Avoid outages by raising the Ulimit
A common issue that can lead to Kafka outages is running out of file descriptors. Each Kafka broker maintains a file descriptor for each log segment, and if a broker runs out of file descriptors, it can crash or become unresponsive.
To avoid this issue, you can raise the Ulimit (the limit on the number of file descriptors a process can open) on your Kafka brokers. The exact Ulimit value you should set depends on the size of your Kafka cluster and the number of topics and partitions you have, but a good rule of thumb is to set the Ulimit to a value much higher than the maximum number of log segments you expect to have.
8. Monitor your cluster
Monitoring your Kafka cluster is crucial for maintaining performance and reliability, and alerts can help you quickly identify and resolve issues before they impact your users.
You should monitor key Kafka metrics such as broker uptime, consumer lag, and message throughput. These metrics can give you insights into the health and performance of your Kafka cluster, and help you identify potential issues early.
9. Optimize replication and acknowledgment settings
Kafka’s durability and reliability depend heavily on how replication and acknowledgments are configured. Two key settings to consider are the replication factor for topics and the acks configuration for producers.
Setting a replication factor of at least 3 in production environments ensures that data remains available even if one broker fails. In addition, configuring producers with acks=all (or -1) ensures that messages are acknowledged only after all in-sync replicas have confirmed receipt. This significantly reduces the risk of data loss.
You should also configure min.insync.replicas to work in tandem with acks=all. This ensures that writes are rejected if the number of in-sync replicas falls below a safe threshold, protecting data integrity during broker failures. Carefully tuning these settings helps strike the right balance between performance, availability, and durability.
Learn more in our detailed guide to apache kafka cluster
Best practices for managing Kafka consumers
Once Kafka is deployed, it’s important to set the right configurations for the Kafka consumers and consumer groups.
10. Choose the right number of partitions
When setting up your Kafka consumers, one of the Kafka best practices you should follow is choosing the right number of partitions for your topics. The number of partitions determines the maximum parallelism of your Kafka consumers, as each partition can be consumed by a separate consumer thread.
If you have too few partitions, you won’t be able to fully utilize your consumer resources, and your message processing may be slower than necessary. On the other hand, if you have too many partitions, you can end up with too much overhead due to the increased coordination between consumers, and your Kafka cluster may become less stable.
A good rule of thumb is to start with a moderate number of partitions, monitor your consumer performance, and adjust the number of partitions as necessary based on your observations.
11. Maintain consumer consistency
Consumer consistency means that the same consumer always consumes the same partition. Maintaining consistency can help improve the performance and reliability of your Kafka consumers. It allows consumers to keep their local caches warm, which can reduce the impact of network latency and improve message processing speed. It also avoids the need for consumers to re-fetch data they have already consumed, which can reduce network traffic and improve overall Kafka cluster performance.
To maintain consumer consistency, you should use Kafka’s consumer groups feature. When you assign a consumer to a consumer group, Kafka will ensure that the same consumer always consumes the same partition, as long as the consumer is part of the same consumer group.
12. Ensure a replication factor greater than 2
When setting up your Kafka topics, you should use a replication factor greater than 2. The replication factor determines the number of copies of each message that Kafka stores.
Using a replication factor greater than 2 can significantly improve the reliability and fault tolerance of your Kafka cluster. If one of your Kafka brokers fails, Kafka can transparently switch to a replica, ensuring that your consumers can continue to consume messages without interruption.
However, a higher replication factor also means more network traffic and storage requirements, so you should find a balance that works for your specific needs. A good rule of thumb is to start with a replication factor of 3, and adjust as necessary based on your requirements and observations.
13. Commit offsets strategically
Offset management plays a critical role in ensuring reliable message processing. Kafka consumers can commit offsets automatically or manually, and choosing the right strategy depends on your processing guarantees.
Automatic offset commits are simpler to configure but may lead to message loss or duplication if a failure occurs between processing and committing. For applications that require stronger delivery guarantees, manual offset commits provide better control. By committing offsets only after successful message processing, you can achieve at-least-once delivery semantics.
For even stricter guarantees, you can combine manual commits with transactional processing to approach exactly-once semantics. Regardless of the strategy you choose, monitoring consumer lag and commit behavior is essential to ensure stable and predictable message consumption.
Fully managed Apache Kafka in the cloud with Instaclustr
Instaclustr offers a fully managed Apache Kafka service, providing a reliable, scalable, and SOC 2 certified solution either in the cloud or on-premises. This service allows you to concentrate on your application development by handling the configuration and optimization of your Kafka cluster.
Instaclustr Managed Kafka is the optimal choice for running Kafka in the cloud, delivering a production-ready and fully supported Apache Kafka cluster swiftly. This fully hosted and managed solution relieves you from the complexities of data infrastructure management, enabling you to focus on innovating your application stack. With Instaclustr, you receive around-the-clock support and a service level agreement (SLA) guaranteeing 99.999% uptime. The platform is SOC 2 certified and complies with PCI-DSS and HIPAA, ensuring top-tier security and reliability.
The service includes built-in monitoring, managed mirroring, and the option to easily integrate Kafka Connect. Instaclustr’s offering is 100% open source, allowing customization and flexibility. You have the choice to run it in your own cloud provider account or use Instaclustr’s, further enhancing its adaptability to your specific needs.