Store Kafka Data to Amazon S3

Menu

This example demonstrates how to store messages from a Kafka topic into an Amazon S3 bucket.

Prerequisites

The Apache Kafka installation comes bundled with a number of Kafka tools. In particular, this example uses the connect-standalone.sh tool. To get this tool you will need to download and install a Kafka release from here. This example has been tested with Kafka 1.1.0. The example also uses a third party plugin to allow us to connect Kafka to S3 buckets that can be found here. Once downloaded, extract the archive and copy the contents of the confluentinc-kafka-connect-s3-4.1.1/lib/ folder to a plugins/kafka-connect-s3/ folder somewhere on your computer. For example:

Kafka Connect Configuration

Before you can use Kafka Connect you need to configure a number of things. Basic configuration requires the following configuration options, see here for a full list of options. Make a file connect.properties with the following content:

Make sure to replace the bootstrap.servers with the IP address of at least one node in your cluster, and /path/to/plugins with the path to your plugins directory.

Note: To connect to your Kafka cluster over the private network, use port 9093 instead of 9092.

In order to use Kafka Connect with Instaclustr Kafka you also need to provide authentication credentials. Add the following to your connect.properties file, ensuring the password is correct:

If your cluster does not have client ⇆ broker encryption enabled, add the following to your connect.properties file:

If your cluster has client ⇆ broker encryption enabled you will also need to provide encryption information. For more information on using certificates with Kafka and where to find them see here. Add the following to your connect.properties file, ensuring the truststore location is correct:

Create Kafka Topic

For Kafka Connect to work, sources and sinks must refer to specific Kafka topics. Before you can run Kafka Connect you need to create a topic to be used for storing the messages produced by Kafka Connect. Use the guide here to create a new topic called s3_topic.

Create S3 Bucket

If you do not already have a S3 bucket, follow the guide here to create one.

S3 Sink Configuration

Now that Kafka Connect is configured, you need to configure the sink for our data. This example uses the S3 Sink from Confluent. For more information on the S3 sink, including more configuration options, see here. The S3 Sink will take all messages from a Kafka topic and store them to a S3 bucket. Make a file s3-sink.properties with the following content:

Make sure to replace <bucket name> and <bucket region> with the name and AWS region of the destination S3 bucket respectively.

AWS Credentials Configuration

To write to the specified S3 bucket the Kafka Connect S3 plugin will need the credentials to an AWS account with access to the S3 bucket. The easiest way to do this is to download and install the AWS CLI, for example:

sudo apt install awscli

With the AWS CLI installed, follow the guide here to configure your account credentials.

Start Kafka Connect

Now that all the Kafka Connect components in this example are configured, you can start Kafka Connect from the command line like so:

Test Kafka Connect

Once Kafka Connect has started, it’s time to test the configuration.

First, follow the guide here to setup a Kafka console producer, changing the topic name to the s3_topic topic.

Once you’ve setup a console producer, send some messages to Kafka:

Note: messages only get sent to the S3 bucket when there are at least the same number of unsent messages as the flush.size value from s3-sink.properties.

After producing the messages to Kafka, use the AWS CLI to list all objects in your S3 bucket:

aws s3api list-objects --bucket "<bucket name>"

Make sure to replace <bucket name> with the name of your S3 bucket, for example:

Finally, use the AWS CLI to download one of the files and verify the contents are correct, making sure to replace <bucket name> with the name of your S3 bucket:

Note: the output file in the example only contains 3 messages because the file only contains messages from a single partition, and the s3_topic in the example has three partitions.

Site by Swell Design Group