One of the most popular use cases of Apache Kafka is to stream data between systems using the Kafka® Connect API. It allows you to focus on pipelines based on the technology being streamed to, instead of worrying about a pipeline for each permutation of data.
In that example, you’d need a running Kafka and Kafka Connect cluster, an OpenSearch cluster to take in the data, and a sink for Kafka to stream the data into OpenSearch. You can set up all of this–even the managed OpenSearch sink connector–with NetApp Instaclustr.
However, I’m here to walk you through a different example: what the sink configuration settings mean from an OpenSearch lens and discuss the necessary steps to take to ensure your cluster can handle the incoming data efficiently.
Configuring the OpenSearch sink connector
How you configure the connector can affect how you’ll want to configure your OpenSearch cluster, and vice versa. These are some of the configuration options to keep an eye on from the connector docs:
batch.size
: This is the number of records to be processed by the connector before sending to OpenSearch. This is important on the OpenSearch side because you want to make sure the incoming batch size won’t overwhelm your cluster (you can learn more about this from this blog post). The main thing you’ll want to keep in mind is the number of shards per index; one shard is not scalable, as I/O would nearly come to a halt.max.in.flight.requests
: This is the number of indexing requests that can be considered “in-flight” (incomplete) that are allowed before other outgoing requests are blocked. You’ll want to make sure that this number is low enough to not cause issues for any load balancing on your OpenSearch cluster, but high enough to make the data flow effectively.
There are also settings that affect how your data is taken in: You can dynamically generate OpenSearch index keys for records using key.ignore.id
and key.ignore.id.strategy
, or ignore certain topics with topic.key.ignore
.
You’ll want to take a full look at all the settings on the docs before setting up your Apache Kafka and OpenSearch clusters, as they can affect the performance of your data streaming.
To configure the plugin, you’ll need to download the plugin (plugin installation details can be found in the README.md
, as well as a sample configuration file). Then you’ll need to add the plugin to your Kafka Connect worker(s); you’ll do this by placing the project binary, however you get it, into the /kafka-connect-plugins
folder of each worker.
Then, you’ll want to add your plugin to the plugin.path
configuration variable:
1 |
plugin.path=/kafka-connect-plugins |
Restart your workers and test the connector’s REST API to be sure it’s been installed correctly. Then you can set up your configuration file in the /config
folder:
1 2 3 |
name=opensearch-sink … topics=event-opensearch-sink # |
This means it will only take events with this topic:
1 |
connection.url=https://your-OpenSearch-Instance.com:9200 |
Preparing your OpenSearch cluster for Kafka event data
To prepare for Kafka event data to be fed into OpenSearch, you’ll want to keep a few things in mind:
- Shards/Index: This is something to watch for all OpenSearch clusters, but you’ll especially want to keep an eye out when you’re talking about sourcing from a very active Apache Kafka Cluster. If you don’t have enough shards per cluster, you’ll eventually fall behind in processing the incoming Kafka data indexing requests
- Resources/Scalability: When you go this route, you are tying your OpenSearch cluster size to accommodate the needs of the Apache Kafka sink. This may mean automated scaling or manual watching and tweaking. The Instaclustr console makes scaling easy; if you run your project hosted, this can be alleviated by the hosting service.
- Data and key management: When you have millions of records coming in, managing the way the data is stored and that it is stored under an easy-to-access key is essential. There are quite a few ways you can use the Kafka sink connector to mutate the key and data that comes in from a Kafka instance, and those will be paramount in your ability to manipulate incoming data.
- Storage Management: A common use case for this setup is to track Apache Kafka logs on OpenSearch. That can get to be a lot of data, fast! You’ll want to investigate features like automatic log rotations in order to make sure you don’t run out of storage.
Conclusion
There are different routes you can take into streaming Apache Kafka data between systems like OpenSearch. I hope this article gave you a solid glimpse into the OpenSearch sink connector of that equation.
Coming soon: we’ll have a new blog showing how to stream data to OpenSearch a different way, this time using the Kafka Connect API.
Ready to try this out yourself? Spin up your first OpenSearch cluster for free on the Instaclustr Managed Platform and get started today.