Technical — Kafka Technical Tuesday 18th February 2020

Migrate Your Kafka Cluster to Instaclustr With Zero Downtime

By Vinfred Solomon

Helping our customers migrate their Kafka application to Instaclustr with no downtime is a key part of our service here at Instaclustr. For Cassandra, you can read about our approach here.

For migrating Kafka cluster, many of our customers are able to migrate by a process of:

  1. Starting up a consumer group against their new Instaclustr cluster
  2. Cutting over producers to the Instaclustr cluster
  3. Running existing consumers against the old cluster until all messages are consumed and then shutting down

This approach is straightforward and works great where you do not have to copy historic data from your existing cluster and you can either live with the potential of messages being consumed out of order for a short period or a short interruption in message consumption (if you do step 1 after step 3 you keep guaranteed ordering at the expense of a processing interruption).

However, some customers need to copy all the historic data from their existing Kafka cluster as part of the migration. In this case, our go to approach for migrating Kafka is to use MirrorMaker to facilitate that copy in conjunction with a migration as outlined above.

However, there are still some scenarios where this approach of migrating Kafka cluster does not fulfill the requirements:

  • Maybe you can’t manage resetting your consumers to a new offset without a long downtime. 
  • Maybe you can’t make any application changes, and have relied on a custom partitioner that isn’t in the Java implementation, so mirrormaker can’t replicate it.
  • Maybe you are not able to tolerate any consumer or producer downtime to cutover to a new cluster.
  • Maybe your partitioning and the offsets of your partitions are built into your business logic outside Kafka.
  • Maybe your source of truth is in Kafka and keying/ordering always returning the same way is vital.

If any of these are true, you most likely want to carry out your Kafka migration by not actually migrating at all. 

So how do we achieve this migration? First we create nodes in our service to match your current Zookeeper and Kafka nodes, placed as near as possible in whichever cloud provider you choose. We then ensure that our config on these nodes matches your current setup, and then we open up network access between customer nodes and Instaclustr nodes. Next, we set the Zookeeper strings and ids to not clash and to be representative of your and our nodes, so that for a time the cluster will run both. Once you are ready, we start up the nodes on our side to join your nodes as zk followers, and the Kafka brokers as new brokers. We then ensure you can read and write to our brokers and everything looks healthy. 

Then we start moving your topics using a great open source tool by Shingo Omura (https://github.com/everpeace/kafka-reassign-optimizer). At first we move just followers, leaving leadership on your nodes to be safe as possible, while still being able to verify our nodes get your data when it is sent to your old nodes. Using the reassign optimizer gives us fine control on what is moved and how, and minimizes the partition movement between different arrangements of moving topics so that we make the absolute least moves in topic number and bytes. Once you are happy and the data flow is working as we expect we move the topics over to our brokers, including all the internal topics. 

At this point, our service is running your Kafka service with your old brokers only acting as proxies to address the topics from your old addresses. You can then work on switching over to address our nodes directly at whatever pace you wish, while we manage the cluster. Once you have stopped addressing your cluster via the old nodes they can be shut down, and we can remove the port access rules between our nodes and yours. Finally your cluster—including all your data and your exact state and history, with all ordering and the exact same offsets—are in a cluster in our service. 

Hold on a second I hear you say, MirrorMaker 2 solves most of these problems too, so why not use that?

It does, and is a magnificent tool for multi-region setups and running multiple clusters with synced data, but if you need to sync offsets you need to code it into your application or a service balancing layer to translate the offsets in each cluster. You would also still rely on a partitioner, so if you haven’t got a standard partitioner, (ruby-kafka I’m looking at you here https://github.com/zendesk/ruby-kafka/blob/master/lib/kafka/partitioner.rb), then you have to develop a custom partitioner in Kafka to replicate that too. You could also subvert the admin client protections and directly copy the consumer offsets from your source cluster to your destination cluster. But then you would need to switch consuming and producing from your old cluster to your new cluster at exactly the same time, and you have to be 100% happy with how your partitioner has sent out the data as you have removed the safeties built into the offsets matching the topics by default.

So that’s how it is a migration and not a migration: the same cluster is just being moved around different individual brokers (and zookeepers). Neat huh?

Need help with migrating Kafka cluster? Contact us now.