Using Apache Kafka with Python: Step by step

What Is Apache Kafka?

Apache Kafka is an open source, distributed event-streaming platform to handle real-time data feeds. Originating from LinkedIn and now maintained by the Apache Software Foundation, Kafka achieves high throughput and low latency, making it suitable for handling massive volumes of data. It also offers integrations with various programming languages and platforms.

Its architecture consists of topics divided into partitions, supporting fault tolerance and parallel consumption by multiple consumers. Applications utilizing Kafka range from messaging services and real-time analytics to event sourcing and log aggregation, leveraging its capabilities to build scalable and responsive systems.

It is common to use Apache Kafka with Python code and applications. Combining Apache Kafka with Python provides developers and data engineers a way to build real-time event-driven applications and data pipelines. Python’s ease-of-use, readability, and vast libraries make it one of the most popular languages for data analysis, machine learning, and backend application development.

This is part of a series of articles about Apache Kafka

Why use Apache Kafka with Python?

Leveraging Kafka’s streaming capabilities alongside Python’s simplicity reduces development complexity and accelerates the deployment of scalable real-time data solutions, improving maintainability and development productivity. Python’s ecosystem offers mature Kafka client libraries such as confluent-kafka-python and kafka-python, providing straightforward integration with Kafka clusters.

These libraries abstract the complexities of Kafka producer-consumer interfaces, helping developers rapidly implement stream processing applications. Additionally, Python integrates with data analytics frameworks, making it simpler to analyze and visualize Kafka data streams in real time.

Learn more in our detailed guide to apache kafka cluster

Tutorial: Using the Python client for Apache Kafka

To interact with Apache Kafka from Python, a widely used option is the kafka-python client library. Below is a basic tutorial showing how to produce and consume messages using this library. These instructions are adapted from the Kafka documentation.

Installing kafka-python

First, install the kafka-python package using pip:

pip install kafka-python

1	pip install kafka-python

Alternatively, add it to your requirements.txt file for dependency management.

Producing messages

To send messages to Kafka, create a KafkaProducer instance. Configure it with your Kafka cluster’s bootstrap servers and authentication settings.

For example, to connect with SSL and SASL authentication:

from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers='54.162.160.113:9092,52.2.170.90:9092,44.216.58.106:9092',
    ssl_cafile='cluster-ca-certificate.pem',
    security_protocol='SASL_SSL',
    sasl_mechanism='SCRAM-SHA-256',
    sasl_plain_username='[YOUR USERNAME]',
    sasl_plain_password='[YOUR PASSWORD]',
)

producer.send('example-topic', b'test')
producer.flush()
print('Published message')

1

2

3

4

5

6

7

8

9

10

11

12

13

14

from kafka import KafkaProducer

producer = KafkaProducer(

bootstrap_servers='54.162.160.113:9092,52.2.170.90:9092,44.216.58.106:9092',

ssl_cafile='cluster-ca-certificate.pem',

security_protocol='SASL_SSL',

sasl_mechanism='SCRAM-SHA-256',

sasl_plain_username='[YOUR USERNAME]',

sasl_plain_password='[YOUR PASSWORD]',

)

producer.send('example-topic', b'test')

producer.flush()

print('Published message')

If your Kafka cluster doesn’t use SSL, the configuration changes to use SASL_PLAINTEXT and omits the SSL parameters.

Consuming messages

To consume messages, use a KafkaConsumer. Configure it with the same bootstrap servers and authentication details:

from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'example-topic',
    bootstrap_servers='54.162.160.113:9092,52.2.170.90:9092,44.216.58.106:9092',
    ssl_cafile='cluster-ca-certificate.pem',
    security_protocol='SASL_SSL',
    sasl_mechanism='SCRAM-SHA-256',
    sasl_plain_username='[YOUR USERNAME]',
    sasl_plain_password='[YOUR PASSWORD]',
    auto_offset_reset='earliest',
    group_id='example-group'
)

try:
    for message in consumer:
        print(f"Received message: {message.value.decode('utf-8')}")
except Exception as e:
    print(f"An exception occurred: {e}")
finally:
    consumer.close()

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

from kafka import KafkaConsumer

consumer = KafkaConsumer(

'example-topic',

bootstrap_servers='54.162.160.113:9092,52.2.170.90:9092,44.216.58.106:9092',

ssl_cafile='cluster-ca-certificate.pem',

security_protocol='SASL_SSL',

sasl_mechanism='SCRAM-SHA-256',

sasl_plain_username='[YOUR USERNAME]',

sasl_plain_password='[YOUR PASSWORD]',

auto_offset_reset='earliest',

group_id='example-group'

)

try:

for message in consumer:

print(f"Received message: {message.value.decode('utf-8')}")

except Exception as e:

print(f"An exception occurred: {e}")

finally:

consumer.close()

Again, if SSL is not used, update the security_protocol accordingly.

Handling Client ⇄ Broker Encryption (SSL and mTLS)

For clusters with SSL or mutual TLS (mTLS) enabled, additional settings are required. For mTLS, provide both the client certificate and private key:

producer = KafkaProducer(
    bootstrap_servers='54.162.160.113:9082,52.2.170.90:9082,44.216.58.106:9082',
    security_protocol='SSL',
    ssl_cafile='CARoot.pem',
    ssl_certfile='certificate.pem',
    ssl_keyfile='key.pem',
    ssl_password='<key-password>',
)

1

2

3

4

5

6

7

8

producer = KafkaProducer(

bootstrap_servers='54.162.160.113:9082,52.2.170.90:9082,44.216.58.106:9082',

security_protocol='SSL',

ssl_cafile='CARoot.pem',

ssl_certfile='certificate.pem',

ssl_keyfile='key.pem',

ssl_password='<key-password>',

)

The consumer would use a similar SSL configuration. Make sure to convert Java keystore (JKS) files to PEM format if needed, using tools like keytool and openssl.

Running Producer and Consumer together

Start the consumer before the producer. Kafka consumers typically receive only new messages produced after the consumer starts, unless configured otherwise.

Once the producer sends a message, the consumer should print it shortly after:

Received message: test

1	Received message: test

This setup provides a simple but functional Python-based Kafka pipeline.

Related content: Learn more in our detailed Apache Kafka tutorial

5 Best practices for Kafka and Python integration

Here are some useful practices to consider when working with Python in Kafka.

1. Efficient data handling

Implement batch message processing by configuring Kafka producers with an optimal batch size and linger time, which reduces network overhead and improves throughput by minimizing frequent sends. Additionally, configuring consumers to batch incoming records optimizes message throughput, improving consumption performance and reducing latency on consumer clients.

Implement consumer group configurations carefully to achieve balanced load distribution and avoid duplicate processing. Careful attention should be paid when configuring partition assignments, committing offsets, and tracking the progress of the processing pipeline. Consuming Kafka messages asynchronously or leveraging concurrency via threads or processes optimizes resource utilization.

2. Error handling and logging

Clearly defining exception handling logic enables applications to recover from transient issues with brokers, network disruptions, or other runtime errors. Adopting “retry policies” and implementing backoff algorithms for retrying failed sends improve overall reliability and fault tolerance within Kafka producer or consumer applications written in Python.

Logging plays a significant role in monitoring and debugging. Python’s integrated logging mechanisms, supported extensively by popular libraries, provide a practical way to gain insight into Kafka consumer and producer behavior. Ensure that logs contain relevant contextual information including timestamps, transaction IDs, partition numbers, and offsets.

3. Performance optimization

Achieving optimal performance when integrating Python applications and Kafka involves careful tuning of various components at different stages of the pipeline. Use Kafka partitions strategically, scaling them in line with the application throughput requirements and consumer parallelism.

By determining the appropriate partition count, producers and consumers can operate with maximum concurrency, eliminating bottlenecks that might otherwise negatively affect performance and latency. Additionally, establishing optimal settings for producer acknowledgments (acks), compression techniques, and buffer memory allocation contributes to improved throughput.

On the consumer side, adjusting fetch sizes, the heartbeat interval, and maximum poll records provides optimization opportunities for responsiveness and efficiency. Monitoring performance benchmarks periodically and executing performance tests enable identification of bottlenecks or inefficiencies.

4. Security and authentication

Ensuring the security of Kafka clusters when integrating into Python applications involves incorporating authentication mechanisms and encrypted communication channels. Kafka natively supports various authentication mechanisms, like SSL/TLS, SASL/Kerberos, SASL/PLAIN, and SASL/SCRAM, which Python Kafka clients can integrate.

Leveraging these options prevents unauthorized access and ensures confidentiality during message transmission. Clearly defining and enforcing appropriate authorization policies using Kafka ACLs or other access control frameworks further improves system security, restricting producer and consumer privileges effectively.

5. Monitoring and metrics

Kafka exposes detailed metrics through its JMX interface, complemented with Python libraries such as confluent-kafka-python, which provide ways to monitor consumer lag, throughput, latency, partition offset health, and other critical indicators. Employing external open-source monitoring tools allows administrators to visualize cluster performance, identify bottlenecks, and respond quickly to incidents.

Tracking consumer lag metrics, message processing latency, and throughput rates frequently highlights areas for potential tuning or reveals underlying issues early. Integrating alerting systems that engage based on defined thresholds ensures rapid response to critical events, maintaining high availability and reliability.

Instaclustr for Apache Kafka: Powering data streams with ease

Apache Kafka is among the most powerful tools for managing real-time data streams, but operating and maintaining a Kafka cluster can get complex for even the most experienced teams. That’s where Instaclustr for Apache Kafka steps in. Offering a fully managed, enterprise-grade Apache Kafka experience, Instaclustr takes the operational burden off your shoulders so you can focus on building and scaling exceptional applications.

Instaclustr simplifies Kafka deployment and management through a resilient cloud architecture, automated scaling, and expert 24×7 support. Beyond that, it ensures high availability of your Kafka installations, secures your data with encryption, and keeps your clusters optimized with consistent updates and monitoring. With these features, developers can trust Instaclustr to handle the technical heavy lifting while they prioritize what matters most for their business.

Bridging Kafka and Python

For teams using Python within their technology stack, Instaclustr for Kafka is the perfect pairing. Python is one of the most versatile and widely used programming languages, ideal for building data-driven applications. When combined with Instaclustr for Kafka, Python developers gain access to an efficient, real-time data processing pipeline that integrates seamlessly into their existing workflows.

With Instaclustr handling the Kafka infrastructure, Python developers don’t have to worry about provisioning clusters, balancing loads, or configuring brokers. Instead, they can focus on leveraging rich Kafka libraries in Python to build robust features like live tracking systems, data analytics dashboards, and personalized user experiences.

Instaclustr eliminates the setup complexity for Kafka and ensures optimal performance, leaving Python developers with a simple but powerful toolkit to execute their ideas flawlessly in real-time. For anyone aiming to transform how data flows through their organization, this combination isn’t just complementary—it’s revolutionary.

For more information: