Over the last few months, NetApp has executed over 700 in‑place ZooKeeper‑to‑KRaft migrations on the NetApp Instaclustr Platform, with more being completed every day. These upgrades have been across production Kafka clusters ranging from small development environments to larger deployments. The migrations were completed with no observed customer-visible application downtime, designed to minimize without manual effort or associated risk, and enable customers to leverage all the benefits of Kafka 4.x. With Kafka 4.x removing ZooKeeper entirely, operators now face a hard requirement: migrate safely or fall behind on support, security, and features. Understanding how Kafka got here and how to cross this boundary without downtime is critical.

History of ZooKeeper to KRaft

For over a decade, Apache ZooKeeper was inseparable from Apache Kafka. It coordinated Kafka’s distributed metadata—tracking brokers and topics, elected controllers and leaders, and managed cluster configuration. Every Kafka cluster was dependent on a ZooKeeper ensemble that administrators had to monitor and manage. But this brought with it multiple issues.

In 2019, KIP-500 proposed eliminating ZooKeeper entirely and replaced it with an internal Raft-based consensus protocol called KRaft. Instead of relying on an external system, it would manage its own metadata through a quorum of controllers using a log inside Kafka itself. Following early access in Apache Kafka 2.8, the feature eventually reached general availability in Kafka 3.6, bringing with it numerous benefits:

  • Controller failovers that took minutes now complete in milliseconds dramatically reducing metadata downtime during failures, keeping clients running uninterrupted
  • Simpler deployments since there is no longer a separate ZooKeeper ensemble to provision, secure, and tune
  • Partition count ceilings have risen by orders of magnitude (over 20 times more from our experiments)
  • Security configurations are unified under Kafka’s own authentication and authorization framework

On the NetApp Instaclustr Managed Platform, we added KRaft in preview (alongside ZooKeeper in GA) with Kafka 3.3.1, and made it available in GA with Kafka 3.6.1.

Why ZooKeeper based clusters are now approaching End of Life

With the release of Apache Kafka 4.0, ZooKeeper support was completely removed, and subsequent Kafka versions only capable of running exclusively in KRaft mode, bringing further improvements including many bugfixes, increased stability and a new consumer group rebalance protocol. Customers who want to take advantage of these capabilities and remain on a supported release will eventually need to migrate from ZooKeeper to KRaft.

Getting ready to cross the ZooKeeper boundary

Suppose you have a production ZooKeeper-based Kafka 3.6.1 cluster, and you want to reach Kafka 4.1.1 to take advantage of the latest features and continued community support.

You can’t jump directly to Kafka 4.x—it doesn’t include ZooKeeper support, so starting a 4.x broker against ZooKeeper metadata simply won’t work. The path forward requires a sequence of deliberate steps, as endorsed by the Kafka project):

  • Upgrade to Kafka 3.9.x—the last major release supporting both ZooKeeper and KRaft, making it the bridge version
  • Migrate ZooKeeper to Kraft while running Kafka 3.9.x—this is the critical transition that we will go into more detail below.
  • Upgrade it to Kafka 4.1.x (and beyond)

Migrating to the future: ZooKeeper to KRaft

Once on Kafka 3.9.x, clusters reach the only point where ZooKeeper and KRaft can coexist. This is where the real migration begins—and where execution quality matters most.

The in-place migration without downtime—high-level steps

Fortunately, Apache Kafka provides a mechanism to migrate a running ZooKeeper cluster to KRaft in place, without downtime. At a high level, the process works as follows:

  1. Deploying the new KRaft quorum
  2. Setting the brokers to migration mode, restarting each broker, and load metadata into the KRaft quorum
  3. Migrating brokers to KRaft-only mode, restarting each broker
  4. Finalizing the migration, restarting each KRaft controller and ensuring KRaft controllers are no longer in migration mode

Throughout this process, the cluster remains operational with restarts occurring in a rolling fashion ensuring that a quorum of nodes is available at every step—resulting in no client application downtime. It is also possible, prior to finalizing the migration, to roll back the cluster to ZooKeeper mode if unexpected issues are encountered.

Why manual ZooKeeper-to-KRaft migration doesn’t scale

On paper, Kafka’s migration mechanism is straightforward. In practice, executing it manually at production scale is slow, error‑prone, and operationally risky.

After deploying the new KRaft quorum, we can boil down the rest of the migration, (steps 2-4 above) to a series of rolling restarts. On production loads, this could easily take 8-10 minutes to perform per node. As we mention above the migration process requires 3 separate rolling restarts across the cluster. For a smaller 9-node cluster with 3 controller nodes, we’d have 15 node restarts in total (2 for each of the 6 brokers and 1 per KRaft controller), or 120-150 minutes under ideal conditions.

With proper verification, this could easily extend by a couple of hours. To make matters worse, organizations might require multiple large clusters to be migrated—performing these sequentially could stretch the time required into days or even weeks.

How Instaclustr automates the migration

To make ZooKeeper to KRaft migration safe, fast, and repeatable at fleet scale, we built an automated migration system that encodes Kafka’s recommended process while significantly reducing the potential for human error.

Instaclustr’s automated migration follows a state-machine-driven process that mirrors Kafka’s endorsed steps. It includes pre- and post-migration backups, rack-aware restarts, and built-in concurrency to halve migration time while eliminating human error.

  1. Initial backup of ZooKeeper nodes: Before the migration begins, a backup of the ZooKeeper nodes occurs to preserve existing metadata. It is retained for seven days.
  2. Deployment of the KRaft controller quorum: The deployment is tailored to the existing configuration of your ZooKeeper configuration e.g. dedicated KRaft controllers are used if you have dedicated ZooKeeper nodes.
  3. Phased broker migration to KRaft mode: Health checks are performed, and nodes are processed rack-by-rack to ensure availability. Broker metadata transfer to the KRaft quorum commences.
  4. A rack-aware restart to move brokers into KRaft mode: Health checks are performed and nodes are processed efficiently rack-by-rack, with multiple brokers within the same rack restarted concurrently. This is safe because Kafka’s rack-aware replica placement ensures no two replicas of the same partition share a rack, so at least one in-sync replica always remains available.
  5. Transition to full KRaft mode: The controllers have their migration-related properties removed and are restarted. The cluster now operates fully in KRaft mode and can no longer be reverted to ZooKeeper.
  6. ZooKeeper components are decommissioned: An additional ZooKeeper backup is performed just prior to decommission and retained for 7 days.

Manual vs. automated: A concrete comparison

Consider the same 9-node cluster (6 brokers + 3 controllers) from earlier. In each broker pass, two nodes within the same rack can be restarted concurrently, reducing 6 sequential restarts to just 3 per pass. Across the two broker passes and one controller pass, the total drops to nine sequential restarts, completing in approximately 90 minutes (assuming 10 minutes per restart)—far less than the time of a manual migration, and without a single operator keystroke.

Pass Instaclustr’s automation (with rack-aware concurrency)
Broker pass 1: migration mode 6 restarts × 10 min =60 min 3 rack groups × 10 min =30 min
Broker pass 2: KRaft-only mode 6 restarts × 10 min =60 min 3 rack groups × 10 min =30 min
Controller pass: finalize migration 3 restarts × 10 min =30 min 3 restarts × 10 min =30 min
Total 15 restarts ~150 min 9 steps ~90 min

A common but riskier alternative: “lift and shift”

While some providers opt for a ‘lift and shift’ approach by creating a new KRaft cluster and migrating data, this method introduces higher operational risk, cost, and complexity. Instaclustr’s in-place migration avoids these pitfalls.

Instaclustr: Tailored migration and uncompromised safety

Instaclustr removes the risk and operational burden from the migration process by making safety, automation, and flexibility first‑class concerns—so teams can migrate with confidence, not caution. All of this is handled by our experienced operations team so that you don’t have to lift a finger during the process.

  1. Safety is built into the process
    • Automated backups of ZooKeeper data pre-migration and post-migration for both recovery and audit purposes.
    • In addition to our robust monitoring and alerting pipeline, automated retries are built-in so that transient failures are handled with configurable retry logic so that an operator doesn’t need to babysit the process.
    • Nodes are restarted in a rack-aware order with configurable batch sizes so that multiple racks are never simultaneously unavailable.
  2. Plan and migration flexibility
    • This migration is available across all Instaclustr plan types and cluster configurations.
    • We don’t restrict migrations to operating within specific pre-defined hours. A maintenance window can be chosen tailored to the needs of your clients and applications. Resumability is included in the process to ensure that it can pause and resume according to your desired migration window.
  3. No infrastructure lock-in
    • The migration works on any Kafka cluster provisioned on the Instaclustr Managed Platform
    • There is no requirement to use a specific orchestration tool, container platform or configuration management system.

Learnings from migration at fleet scale

We’ve migrated over 700 clusters with diverse configurations. Early runs required manual intervention in ~5% of cases, which we further reduced to ~1% through iterative improvements. These learnings informed smarter defaults, such as dynamic alert suppression windows and hung-state detection. Based on internal operational data across production migrations on the Instaclustr Managed Platform.

Long-running migrations need resumability built in

Some customers restrict migrations to defined maintenance periods, which can be insufficient for larger clusters. To address this, we introduced resumability into the migration lifecycle, allowing operations to pause and safely resume (from predefined checkpoints). This also protects against transient failures—such as controller-election instability or brief external dependency outages—that would otherwise force risky manual recovery or full restarts, significantly increasing operational risk.

Fleet-scale execution data is essential for tuning defaults

Migration defaults cannot be set once and forgotten—they need to evolve as the fleet grows. By continuously analyzing runtime behaviour across hundreds of migrations, we were able to replace static assumptions with evidence-based tuning. For example, after more than 600 runs, we found that a fixed alert-suppression window expired too early for large clusters. We resolved this by scaling the duration linearly with cluster size—arriving at approximately 40 minutes per node as the effective default optimal window. It is this kind of iterative, data-driven refinement that keeps our defaults accurate as cluster configurations diversify.

Alert suppression can hide legitimate issues

To reduce alert noise during migrations, temporary alert silencing is common—but it comes with a visibility trade-off. We’ve seen cases where long suppression windows allowed hung migrations to go unnoticed. To address this, we added hung-state detection to distinguish slow progress from stalled workflows, along with a two-tier alerting model that preserves operational visibility throughout the migration, enabling quick responses if required.

Ready to migrate to KRaft?

The shift from ZooKeeper to KRaft is no longer optional—it’s the present and future of Apache Kafka. Eventually, every Kafka operator still running ZooKeeper-based clusters will need to make this transition to stay on the supported versions and have access to the latest features.

If you’re already using Instaclustr for Apache Kafka, you have access to a robust, battle-tested migration automation that handles the complexity of the in-place ZooKeeper to KRaft migration. Your cluster will stay online throughout, benefiting from automatic retries, rack-aware safety and pre-and-post migration backups.

For support-only customers, our dedicated consulting team can work closely with you to provide similar automated-assisted migrations tailored to your environment, reducing the operational burden and risk of manual execution.

Teams running self‑managed ZooKeeper‑based Kafka clusters can have their workloads migrated onto the Instaclustr Managed Platform, where the transition to KRaft is handled safely and automatically.

Our support team is here to help, and you can find step‑by‑step details in our documentation. If you’d prefer personalized guidance, please get in touch to book a free consultation with one of our Kafka migration specialists.