Apache Kafka® Connect: The basics and a quick tutorial

What is Apache Kafka Connect?

Apache Kafka Connect is a scalable tool for integrating Apache Kafka with other data systems. It serves as an interface that simplifies data transfer to and from Kafka, eliminating the need for custom integrations. Kafka Connect provides pre-built connectors for various systems, making it easier to stream data and ensuring lower latency in data flow across different platforms.

Kafka Connect’s architecture enables distributed data movement, leveraging Kafka’s capabilities of handling large-scale data streams. It abstracts complexities by offering configurations for data streams, harmonizing with Kafka’s messaging framework. As a component of data ecosystems, Kafka Connect supports data logistics, fitting into diverse infrastructures with minimal overhead.

This is part of a series of articles about Apache Kafka

Key features and benefits of Kafka Connect

Kafka Connect includes the following features and capabilities:

Pre-built connectors: Offers a range of connectors for databases, cloud services, and message queues, reducing the need for custom integration.
Scalability: Has a distributed architecture for handling large-scale data streams with high throughput.
Fault tolerance: Automatic task redistribution ensures resilience in distributed deployments.
Flexible deployment: Supports standalone mode for testing and distributed mode for production use cases.
Dynamic configuration: Allows users to update connectors and tasks without downtime, ensuring operations.
Inline transformations: Enables single message transforms (SMTs) for modifying data in transit, reducing post-processing effort.
Schema management: Integrates with schema registries for formats like Avro and JSON schema, ensuring data consistency.
Monitoring tools: Provides built-in metrics and logging for monitoring system performance and diagnosing issues.
Active open source community: Provides access to community-driven connectors and ongoing feature development.

Related content: Read our guide to Apache Kafka tutorial

Kafka Connect architecture

Kafka Connect’s architecture includes the following components.

Connectors

Connectors in Kafka Connect enable interaction between Kafka and external data systems. They are reusable components that handle data ingestion and egress, transforming raw data into streams that Kafka processes. Each connector acts like a bridge, minimizing manual intervention required to move data in and out of Kafka.

Kafka Connect distinguishes between source and sink connectors. Source connectors ingest data from external systems into Kafka topics, while sink connectors take data from Kafka topics and push it into other systems. This dual role allows Kafka Connect to integrate into existing architectures, enabling real-time data processing and analytics.

Workers

Workers are the engine of Kafka Connect, responsible for executing connectors and tasks. They operate as part of a Kafka Connect cluster and can be deployed in standalone or distributed mode. Workers provide manageability and scalability, ensuring each connector’s tasks are executed efficiently and can be scaled across multiple nodes.

In standalone mode, a single worker handles all tasks, which suits development and testing environments. In distributed mode, workers manage tasks collaboratively across a cluster, suitable for production settings needing fault tolerance and high availability.

Standalone mode

Standalone mode in Kafka Connect is suitable for single-node configurations, primarily used for testing or lone deployments. In this setup, a single worker handles all activities, offering a simple approach to experiment with configurations and minimize complexity in deployment.

However, standalone mode is limited in scalability and fault tolerance. It lacks automatic load balancing, making it less effective for production environments demanding high availability.

Distributed mode

By deploying multiple workers across a cluster, distributed mode handles large volumes of data with load balancing and fault tolerance features. This ensures continuous data processing even if some nodes fail.

Distributed mode offers automatic task rebalancing and load distribution, making it suitable for large-scale, mission-critical applications. It supports dynamic updates, allowing connectors and tasks to be modified on-the-fly without downtime.

Tasks

Tasks in Kafka Connect are subordinate to connectors and perform the actual data movement between systems. They are instances that execute specific portions of a connector’s workload, enabling parallelism and efficiency in data flow operations. Tasks can be scaled to utilize system resources effectively, improving throughput.

By dividing connector workloads into multiple tasks, Kafka Connect achieves better resource utilization and performance. Each task can operate independently, reducing bottlenecks in data processing. This division allows fine-grained control over data streams, aligning operational capabilities with system demands, promoting optimized data handling.

Converters and serialization

Converters in Kafka Connect transform data from its native format into standardized Kafka records, and vice versa, depending on the direction of data flow. This transformation is crucial for data compatibility and ensures that information retains its context as it moves through pipelines. Popular converter formats include JSON, Avro, and Protobuf.

Serialization plays a role in converters by encoding and decoding data efficiently. Using serialization frameworks, Kafka Connect preserves data structure integrity and supports complex data models. This process ensures interoperability across connected systems, enabling accurate data transfer and storage, which is important for downstream analytics.

Transformations

Transformations in Kafka Connect modify data as it travels through the pipeline, enabling adjustments to the stream content. They act on messages before they are published to a topic or sent to a sink, allowing organizations to tailor data to their requirements without altering source systems.

By implementing transformations, users can filter, mask, and enrich data inline, optimizing it for consumption. These capabilities simplify data workflows, reduce post-processing needs, and ensure the data aligns with business objectives. The flexibility of transformations allows Kafka Connect to cater to diverse data processing scenarios.

Dead letter queues

Dead letter queues (DLQs) in Kafka Connect handle records that fail processing or delivery, ensuring no data is lost during glitches. DLQs capture these exceptions, offering a systematic way to address errors post-factum. They enable the analysis of faulty records without impacting the primary data stream.

Through DLQs, Kafka Connect maintains an error management strategy, allowing operators to inspect, replay, or dismiss these problematic records accordingly. This approach simplifies troubleshooting and improves system resilience, especially critical in time-sensitive applications.

Related content: Read our guide to Kafka architecture

Tips from the expert

Paul Brebner

Technology Evangelist at NetApp Instaclustr

Paul has extensive R&D and consulting experience in distributed systems, technology innovation, software architecture, and engineering, software performance and scalability, grid and cloud computing, and data analytics and machine learning.

In my experience, here are tips that can help you better leverage Apache Kafka Connect:

Tune task parallelism dynamically: Adjust the tasks.max configuration for connectors based on real-time system performance metrics. Start with a smaller number and scale incrementally, observing CPU, memory, and Kafka throughput to optimize parallelism without overwhelming resources.
Implement custom Single Message Transforms (SMTs): While Kafka Connect includes standard SMTs, writing custom SMTs allows you to preprocess or enrich data inline. For instance, you can add dynamic metadata, remove PII data, or transform records into custom formats, reducing post-processing workloads downstream.
Leverage DLQs for better error diagnosis: For production systems, configure Dead Letter Queues (DLQs) with well-monitored Kafka topics. Extend DLQ handling to alerting tools like Prometheus or Grafana so you can proactively address faulty records without disrupting your data pipeline.
Optimize converter serialization for large-scale data: For high-throughput environments, prefer Avro or Protobuf formats over JSON for better performance and schema evolution. Combine this with a schema registry for central management, ensuring compatibility across data systems.
Integrate security with end-to-end encryption: Secure Kafka Connect clusters by enabling SSL/TLS for all connectors, workers, and topics. Use SASL for authentication and encrypt credentials using external tools like HashiCorp Vault to mitigate data breaches.

Tutorial: Getting started with Kafka Connect

This tutorial will walk you through installing a Kafka Connect plugin and configuring workers in standalone and distributed modes. These instructions are adapted from the Confluent documentation.

Installing Kafka Connect plugins

Kafka Connect plugins enable the use of custom connectors, transforms, and converters. Each plugin is a set of JAR files that encapsulate the logic needed to integrate systems with Kafka.

Plugins are isolated, meaning their libraries do not interfere with others, ensuring stability even when using connectors from multiple providers.

Installation steps:

Locate the plugin path: Ensure your Kafka Connect worker configuration specifies a valid plugin.path. This is a comma-separated list of directories where plugins reside:

plugin.path=/usr/local/share/kafka/plugins

1

plugin.path=/usr/local/share/kafka/plugins
Place plugin files: Place the plugin directory or uber JAR (a single JAR containing all dependencies) into the directory specified by plugin.path.
Verify plugin availability: When the Connect worker starts, it automatically detects all plugins in the specified path. Ensure the directories contain no duplicate versions of a plugin.

Example: Installing a custom connector

Assuming the chosen plugin is in the directory /path/to/my-connector, add this to the plugin.path:

plugin.path=/usr/local/share/kafka/plugins,/path/to/my-connector

1	plugin.path=/usr/local/share/kafka/plugins,/path/to/my-connector

Start the worker, and the new connector becomes available.

Configuring and running workers

Kafka Connect workers can run in two modes: standalone or distributed. Standalone mode is suitable for development, testing, or single-node deployments. All tasks run on a single worker. Distributed mode is recommended for production environments. It uses multiple workers in a cluster for scalability and fault tolerance.

Standalone mode configuration example:

Create a standalone.properties file for the worker configuration:

bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
offset.storage.file.filename=/tmp/connect.offsets

1

2

3

4

bootstrap.servers=localhost:9092

key.converter=org.apache.kafka.connect.json.JsonConverter

value.converter=org.apache.kafka.connect.json.JsonConverter

offset.storage.file.filename=/tmp/connect.offsets

Launch the worker with the following command:

bin/connect-standalone worker.properties connector1.properties

1

bin/connect-standalone worker.properties connector1.properties
1. worker.properties: Configures the worker (e.g., Kafka brokers, serialization formats).
2. connector1.properties: Specifies settings for the connector.

Standalone mode code example: Launching a FileSource connector

This example reads data from a file and sends it to a Kafka topic:

name=file-source connector.class=FileStreamSource tasks.max=1 file=/tmp/input.txt topic=example-topic

1
2
3
4
5

name=file-source
connector.class=FileStreamSource
tasks.max=1
file=/tmp/input.txt
topic=example-topic
Run the worker:

bin/connect-standalone standalone.properties file-source.properties

1

bin/connect-standalone standalone.properties file-source.properties

Distributed mode configuration example

Create a distributed.properties file:

bootstrap.servers=localhost:9092
group.id=connect-cluster
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
config.storage.topic=connect-configs
offset.storage.topic=connect-offsets
status.storage.topic=connect-status

1

2

3

4

5

6

7

bootstrap.servers=localhost:9092

group.id=connect-cluster

key.converter=io.confluent.connect.avro.AvroConverter

value.converter=io.confluent.connect.avro.AvroConverter

config.storage.topic=connect-configs

offset.storage.topic=connect-offsets

status.storage.topic=connect-status

Launch a worker:

bin/connect-distributed.sh distributed.properties

1

bin/connect-distributed.sh distributed.properties

Managing connectors in distributed mode:

In distributed mode, connectors are created and managed via REST API requests. Example:

curl -X POST -H "Content-Type: application/json" --data '{
"name": "jdbc-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "2",
"topics": "example-topic",
"connection.url": "jdbc:mysql://localhost:3306/mydb",
"connection.user": "user",
"connection.password": "password",
"insert.mode": "insert",
"auto.create": "true"
}
}' https://localhost:8083/connectors

1

2

3

4

5

6

7

8

9

10

11

12

13

curl -X POST -H "Content-Type: application/json" --data '{

"name": "jdbc-sink",

"config": {

"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",

"tasks.max": "2",

"topics": "example-topic",

"connection.url": "jdbc:mysql://localhost:3306/mydb",

"connection.user": "user",

"connection.password": "password",

"insert.mode": "insert",

"auto.create": "true"

}

}' https://localhost:8083/connectors

Summary of Kafka Connect modes:

Aspect	Standalone Mode	Distributed Mode
Use case	Development, Testing	Production
Scalability	Limited	High
Fault tolerance	None	Automatic task redistribution
Configuration	CLI-based	REST API-based

Configuring auto topic creation for source connectors

Kafka Connect can automatically create topics for source connectors if the required topics do not already exist on the Kafka broker. This feature, introduced in Confluent Platform version 6.0, simplifies the setup of source connectors by reducing manual topic creation steps.

To enable this feature, you need to configure the worker and the source connector:

Worker configuration: Add the following property to the worker configuration file:

topic.creation.enable=true

1

topic.creation.enable=true

This enables auto topic creation for all source connectors on the worker. By default, this property is set to true.
Note: This setting applies only to source connectors. Adding this property to sink connectors will result in a warning.
Source connector configuration: For each source connector, define the required topic creation properties. At minimum, you need to specify:

topic.creation.default.replication.factor= topic.creation.default.partitions=

1
2

topic.creation.default.replication.factor=
topic.creation.default.partitions=

These properties determine the replication factor and the number of partitions for any new topic created by the source connector.
Optionally, you can customize topic creation using topic groups. Topic groups allow you to apply different configurations to specific sets of topics.

Here are some examples of configuring auto topic creation for source connectors:

Example 1: Basic configuration

topic.creation.default.replication.factor=3
topic.creation.default.partitions=5

1 2	topic.creation.default.replication.factor=3 topic.creation.default.partitions=5

This configuration creates all new topics with a replication factor of 3 and 5 partitions.

Example 2: Topic groups with custom partitions

topic.creation.groups=inorder
topic.creation.default.replication.factor=3
topic.creation.default.partitions=5
topic.creation.inorder.include=status, orders.*
topic.creation.inorder.partitions=1

1

2

3

4

5

topic.creation.groups=inorder

topic.creation.default.replication.factor=3

topic.creation.default.partitions=5

topic.creation.inorder.include=status, orders.*

topic.creation.inorder.partitions=1

Default group: Topics have a replication factor of 3 and 5 partitions.
Inorder group: Topics matching status or starting with orders. are created with 1 partition.

Example 3: Advanced inclusion and exclusion

topic.creation.groups=highly_parallel, compacted
topic.creation.default.replication.factor=3
topic.creation.default.partitions=5
topic.creation.highly_parallel.include=hpc.*,parallel.*
topic.creation.highly_parallel.exclude=.*internal, .*metadata
topic.creation.highly_parallel.replication.factor=1
topic.creation.highly_parallel.partitions=100
topic.creation.compacted.include=configurations.*
topic.creation.compacted.cleanup.policy=compact

1

2

3

4

5

6

7

8

9

topic.creation.groups=highly_parallel, compacted

topic.creation.default.replication.factor=3

topic.creation.default.partitions=5

topic.creation.highly_parallel.include=hpc.*,parallel.*

topic.creation.highly_parallel.exclude=.*internal, .*metadata

topic.creation.highly_parallel.replication.factor=1

topic.creation.highly_parallel.partitions=100

topic.creation.compacted.include=configurations.*

topic.creation.compacted.cleanup.policy=compact

Highly parallel group: Topics matching hpc.* or parallel.* are created with 100 partitions and 1 replication factor, except those matching .*internal or .*metadata.
Compacted group: Topics starting with configurations. are compacted.

Unlock the full potential of Kafka with Instaclustr for Apache Kafka Connect

For businesses leveraging Apache Kafka to handle massive data streams, Instaclustr for Kafka Connect offers a seamless way to extend and optimize your data integration capabilities. By eliminating the complexities of deploying and managing Kafka Connect, Instaclustr empowers organizations to focus on building value-driven applications and scaling their operations with confidence.

Simplified data integration at scale

Managing data pipelines can quickly become daunting, especially as infrastructure grows. Instaclustr for Apache Kafka Connect simplifies this process by managing the integration between Kafka and external systems, such as databases, cloud storage solutions, and SaaS tools. With fully managed connectors and an intuitive platform, Instaclustr ensures a reliable, scalable, and efficient flow of data across your organization.

Built-in reliability and performance

When mission-critical applications depend on real-time data pipelines, downtime is not an option. Instaclustr provides a fully managed infrastructure with 24×7 monitoring, automated backups, and proactive maintenance to ensure your system maintains peak performance. By leveraging a fault-tolerant architecture, Instaclustr minimizes risks while ensuring your Kafka Connect ecosystem is highly available and optimized for demanding workloads.

Focus on innovation, not maintenance

One of the standout benefits of Instaclustr for Kafka Connect is the ability to free up your teams from the operational overhead of managing complex connectors. Instaclustr automates the heavy lifting—from deployment and upgrades to scaling and troubleshooting. This allows your developers to redirect their energy into creating innovative solutions and improving customer-focused applications instead of wrangling infrastructure.

Enterprise-grade security

Instaclustr understands the importance of securing sensitive data. With enterprise-grade encryption, authentication, and access control policies built-in, you can be confident that your data streams are protected at every stage. Compliance and security are non-negotiable, and Instaclustr ensures you meet those standards with ease.

Why choose Instaclustr for Apache Kafka Connect?

By partnering with Instaclustr for Apache Kafka Connect, businesses can harness the power of Apache Kafka’s robust data streaming capabilities without the operational headaches. It’s the perfect solution for organizations ready to scale with reliability, security, and efficiency at the forefront.

Elevate your data integration strategy and maximize the potential of your Kafka infrastructure. Instaclustr for Apache Kafka Connect is more than a managed service—it’s your trusted guide to simplifying real-time data pipelines while driving innovation.

For more information:

Get Your Apache Camel™ Kafka Connectors in a Row