What is Kafka monitoring?

Kafka monitoring involves continuously observing and analyzing the performance and behavior of a Kafka cluster to ensure smooth and optimal operation, especially in production environments. Key metrics include throughput, latency, consumer lag, and broker resource utilization. Monitoring is crucial for identifying and resolving issues promptly, preventing downtime, and maintaining data integrity and security.

Organizations monitor Kafka to:

  • Identify performance bottlenecks: Monitoring helps pinpoint slow consumers, overloaded brokers, or other issues impacting performance.
  • Ensure data integrity: Tracking consumer lag and other metrics helps ensure data is processed correctly and in a timely manner.
  • Prevent downtime: Proactive monitoring allows for the detection and resolution of potential problems before they lead to service disruptions.
  • Optimize resource utilization: Monitoring helps identify areas where resources can be better allocated for improved efficiency.

Editor’s note: Updated the article to cover SLO reporting, updated information for Kafka monitoring solutions to reflect features and capabilities in 2026.

This is part of a series of articles about Apache Kafka

Why is monitoring Kafka important?

Monitoring Kafka is essential for maintaining system stability, performance, and security. It enables teams to detect issues before they escalate and ensures the platform runs efficiently under varying loads.

  • Capacity planning: Tracking metrics like storage usage, message throughput, and consumer lag helps forecast future resource needs. With these insights, teams can plan infrastructure growth, scale Kafka clusters appropriately, and avoid disruptions due to resource exhaustion.
  • Performance optimization: Monitoring provides visibility into system-level metrics such as CPU load, disk I/O, and network traffic. This data is key to identifying bottlenecks and tuning configurations. For example, analyzing consumer lag allows teams to spot slow consumers and adjust consumer group settings to maintain real-time processing.
  • Efficient troubleshooting: Kafka’s distributed nature makes debugging difficult without continuous monitoring. By correlating logs and metrics, teams can pinpoint issues quickly. For instance, simultaneous drops in response rate and increased timeouts in logs may indicate a broker problem, enabling targeted investigation and faster resolution.
  • Security and compliance: Monitoring also aids in detecting abnormal activity, such as unauthorized access or unusual data flows. It helps enforce compliance by tracking data access, retention policies, and audit logs, ensuring the Kafka environment meets security and regulatory requirements.

Related content: Read our guide to Kafka management

Key Kafka metrics explained

JMX Monitoring

JMX (Java Management Extensions) is the primary interface Kafka uses to expose metrics from brokers and clients. Kafka brokers use Yammer Metrics for internal metrics, while Java clients use Kafka Metrics, both of which support JMX. These metrics can be visualized with tools like jconsole or exported to external monitoring platforms.

Key metrics:

  • MessagesInPerSec: Incoming message rate per topic or cluster-wide
  • BytesInPerSec: Bytes received from clients per topic or overall
  • BytesOutPerSec: Bytes sent to clients per topic or overall
  • RequestMetrics.RequestsPerSec: Request rate per request type and version
  • RequestMetrics.ErrorsPerSec: Error rate per request type and error code
  • BrokerTopicMetrics.FailedProduceRequestsPerSec: Failed produce request rate
  • BrokerTopicMetrics.FailedFetchRequestsPerSec: Failed fetch request rate
  • RequestQueueSize: Size of the request queue
  • LogFlushRateAndTimeMs: Log flush rate and time
  • UnderReplicatedPartitions: Number of under-replicated partitions
  • IsrShrinksPerSec / IsrExpandsPerSec: ISR shrink and expansion rates
  • records-lag-max: Max consumer lag (from client JMX)

Tiered storage monitoring

Tiered storage allows Kafka to offload older log segments to external storage, reducing local disk usage. Monitoring this feature ensures timely data movement and highlights any issues with fetch or copy operations between local and remote tiers.

Key metrics:

  • RemoteFetchBytesPerSec: Bytes fetched from remote storage per topic
  • RemoteCopyBytesPerSec: Bytes written to remote storage per topic
  • RemoteFetchRequestsPerSec: Read request rate to remote storage
  • RemoteCopyRequestsPerSec: Write request rate to remote storage
  • RemoteCopyLagBytes: Bytes not yet tiered to remote storage
  • RemoteDeleteLagBytes: Tiered bytes pending deletion
  • RemoteLogSizeBytes: Total size of remote log
  • RemoteLogMetadataCount: Count of metadata entries for remote storage
  • RemoteLogReaderTaskQueueSize: Queue size of remote read tasks
  • RemoteLogManagerTasksAvgIdlePercent: Idle time of tiering thread pool

KRaft monitoring

KRaft (Kafka Raft Metadata Mode) replaces ZooKeeper in newer Kafka versions. Monitoring KRaft helps track metadata replication, controller state, quorum health, and election behavior.

Key metrics:

  • raft-metrics.CurrentState: Role of the node (e.g., leader, follower)
  • raft-metrics.CurrentLeader: ID of the current quorum leader
  • raft-metrics.HighWatermark: Quorum high watermark offset
  • raft-metrics.AppendRecordsRate: Record append rate
  • MetadataLoader.CurrentMetadataVersion: Active metadata version
  • SnapshotEmitter.LatestSnapshotGeneratedBytes: Size of latest metadata snapshot
  • KafkaController.ActiveControllerCount: Number of active controllers
  • KafkaController.FencedBrokerCount: Number of fenced brokers
  • KafkaController.MetadataErrorCount: Count of metadata processing errors

Selector monitoring

Selector metrics help monitor I/O activity in Kafka clients and workers. These include network readiness checks and time spent in I/O operations.

Key metrics:

  • select-rate: Number of I/O select calls per second
  • select-total: Total I/O select calls
  • io-wait-time-ns-avg: Average time waiting for I/O readiness
  • io-wait-ratio: Fraction of time spent waiting for I/O
  • io-time-ns-avg: Average I/O time per select call
  • io-ratio: Fraction of time spent on actual I/O work
  • connection-count: Current number of active connections

Common node monitoring

Node-level metrics track client interactions with specific Kafka broker nodes. These metrics offer insight into per-node request volume, data transfer, and latency.

Key metrics:

  • outgoing-byte-rate: Average outgoing bytes per second for a node
  • incoming-byte-rate: Average incoming bytes per second for a node
  • request-rate: Request rate per node
  • request-size-avg: Average request size per node
  • request-latency-avg: Average latency of requests per node
  • response-rate: Response rate per node
  • connection-close-rate: Rate of connection closures

Producer monitoring

Producer monitoring tracks how clients produce data, including buffering behavior, error rates, retries, and request latencies. These metrics help identify issues like buffer exhaustion or high retry volumes.

Key metrics:

  • record-send-rate: Records sent per second
  • record-error-rate: Error rate of record sends
  • record-retry-rate: Retry rate of record sends
  • requests-in-flight: In-flight produce requests
  • buffer-available-bytes: Available buffer memory
  • batch-size-avg: Average batch size in bytes
  • produce-throttle-time-avg: Average broker throttle time for producers
  • record-queue-time-avg: Time records wait in the send buffer

Consumer monitoring

Consumer metrics track how data is fetched and committed by clients. They include polling behavior, fetch rates, consumer lag, and group coordination performance.

Key metrics:

  • records-consumed-rate: Number of records consumed per second
  • records-lag-max: Maximum lag in records
  • fetch-latency-avg: Average latency for fetch requests
  • fetch-size-avg: Average fetch size
  • commit-rate: Rate of offset commits
  • rebalance-latency-avg: Time taken to rebalance
  • assigned-partitions: Number of partitions currently assigned
  • heartbeat-rate: Heartbeats per second sent to the group coordinator

Connect monitoring

Kafka Connect exposes metrics for worker-level operations, connectors, and individual tasks. These help monitor task lifecycle, rebalance events, and error handling.

Key metrics:

  • connector-count: Number of active connectors
  • task-count: Number of active tasks
  • rebalance-avg-time-ms: Average rebalance time
  • offset-commit-avg-time-ms: Average time to commit offsets
  • sink-record-lag-max: Max lag between consumer position and sink processing
  • sink-record-read-rate: Rate of records read from Kafka
  • source-record-write-rate: Rate of records written to Kafka by source connectors
  • deadletterqueue-produce-failures: Failed writes to dead-letter queue
  • total-record-errors: Number of record-level processing errors

Alerting and SLO-based monitoring

Setting up alerts and service level objectives (SLOs) ensures that issues in Kafka are detected early and resolved before they impact users or downstream systems. Instead of monitoring every metric, focus on those that indicate degraded service, data loss risk, or resource exhaustion.

Key areas to alert on:

  • Availability: Alert if UnderReplicatedPartitions is greater than 0 or if IsrShrinksPerSec spikes. These indicate replication issues that can lead to data unavailability.
  • Durability: Track LogFlushRateAndTimeMs and RemoteCopyLagBytes. Long delays in flushing logs or tiering data can risk data loss.
  • Throughput: Watch BytesInPerSec, BytesOutPerSec, and request rates. Sudden drops may signal client issues or bottlenecks.
  • Latency: Use metrics like request-latency-avg and fetch-latency-avg to catch rising response times. High latency often precedes timeouts or client failures.
  • Errors: Alert on ErrorsPerSec, FailedProduceRequestsPerSec, and record-error-rate. Persistent errors suggest broken producers, client misconfigurations, or broker instability.
  • Consumer Lag: records-lag-max is critical for detecting slow consumers. Alert if lag grows continuously without reduction.

Learn more in our detailed guide to Apache Kafka cluster

Tips from the expert

Andrew Mills

Andrew Mills

Senior Solution Architect

Andrew Mills is an industry leader with extensive experience in open source data solutions and a proven track record in integrating and managing Apache Kafka and other event-driven architectures

In my experience, here are tips that can help you better monitor Apache Kafka:

  • Monitor key broker metrics for cluster health: Keep a close eye on broker-level metrics such as CPU usage, disk I/O, and network throughput. Pay special attention to under-replicated partitions and offline partitions, as they indicate potential issues with data replication and availability.
  • Track consumer lag for performance insights: Consumer lag is a critical metric that hows the delay between message production and consumption. High lag can indicate slow consumers or bottlenecks in processing. Use Kafaka’s built in tools or a managed service like Instaclustr for monitoring solutions to track consumer group offsets and ensure they are keeping up with the producers.
  • Track network-level congestion and TCP retransmissions: Kafka is sensitive to network performance. Monitoring packet drops, retransmissions, and interface queue lengths helps identify issues like overloaded NICs or faulty firewalls that impair broker communication.
  • Leverage end-to-end monitoring for data flow visibility: Monitor the entire data pipeline, from producers to brokers to consumers, to identify bottlenecks or failures at any stage. Use tools like Kafka Connect to track the performance of connectors.

Notable Kafka monitoring tools

1. NetApp Instaclustr

NetApp Instaclustr logo

Instaclustr for Apache Kafka includes top-tier monitoring capabilities that simplify managing Kafka clusters, ensuring streaming data pipelines perform at their best with minimal effort.

Instaclustr for Kafka monitoring provides real-time visibility into the health and performance of clusters and proactively identifies potential issues before they escalate. Key metrics, such as throughput, partition distribution, consumer lag, and broker health, are tracked using advanced monitoring tools, providing insights needed to make data-driven decisions with confidence. Instaclustr’s monitoring capabilities make navigating Kafka’s intricate architecture straightforward, reducing downtime and keeping applications running smoothly.

Instaclustr includes automated platform alerts to detailed reporting, every feature is crafted to simplify workflows while maximizing Kafka’s potential. This empowers teams to focus on innovation rather than troubleshooting.

License: Apache-2.0
Repo: https://github.com/instaclustr

Key features include:

  • User friendly dashboard: Provides a ready to use dashboard that displays Kafka metrics from all major components – brokers, producers, consumers, KRaft, Kafka connect and displays general node metrics for cluster health like CPU usage, disk usage, and memory usage on nodes.
  • Automated monitoring and alerts: Stay ahead of potential problems with real-time monitoring and automated alerts that help ensure the reliability of your deployments.
  • Customized scaling: Easily scale your Kafka clusters to match your business needs, handling increasing workloads effortlessly.
  • Comprehensive reporting: Access detailed performance reports to analyze and optimize your Kafka workloads effectively.
  • Flexible integrations: Instaclustr’s Monitoring API is designed to allow you to integrate the monitoring information from your Instaclustr managed cluster with the monitoring tool used for the entire application – Prometheus, Data Dog, and more.
  • Fully managed service: Instaclustr takes care of the entire Kafka operation, from provisioning to maintenance, ensuring a seamless experience with minimal downtime.

Instaclustr dashboard screenshot

2. Prometheus

Prometheus logo

Prometheus is an open-source monitoring and alerting toolkit for collecting and storing time series metrics. It uses a pull-based model to gather metrics over HTTP and provides a query language called PromQL for analyzing collected data. Prometheus is commonly used for infrastructure, application, and microservices monitoring, including Kafka environments.

License: Apache-2.0
Repo: https://github.com/prometheus/prometheus

Key features include:

  • Time series data collection: Stores metrics as time series data with timestamps and optional key-value labels for flexible querying.
  • PromQL query language: Provides a query language for filtering, aggregating, and analyzing metrics across distributed systems.
  • Pull-based monitoring model: Collects metrics by scraping HTTP endpoints exposed by monitored services.
  • Service discovery support: Automatically discovers monitoring targets through service discovery integrations or static configuration.
  • Standalone architecture: Each Prometheus server operates independently without relying on distributed storage systems.
  • Alerting capabilities: Integrates with Alertmanager to generate and route alerts based on metric conditions.

Prometheus dashboard screenshot

Source: Prometheus

3. CMAK (Cluster Manager for Apache Kafka)

CMAK logo

CMAK (Cluster Manager for Apache Kafka), previously known as Kafka Manager, is an open-source tool for managing and monitoring Apache Kafka clusters. It provides a web-based interface for inspecting cluster health, managing topics and partitions, monitoring brokers and consumers, and performing operational tasks such as replica elections and partition reassignment.

License: Apache-2.0
Repo: https://github.com/yahoo/CMAK

Key features include:

  • Multi-cluster management: Supports administration and monitoring of multiple Kafka clusters from a single interface.
  • Cluster state inspection: Displays information about brokers, topics, consumers, offsets, partition distribution, and replica assignments.
  • Topic management: Allows users to create, update, delete, and expand Kafka topics.
  • Partition reassignment tools: Generates partition assignments and supports partition reassignment operations across brokers.
  • Preferred replica election: Enables execution of preferred replica elections to rebalance leadership across brokers.
  • Batch partition operations: Supports bulk partition assignment and reassignment tasks for multiple topics.

CMAK dashboard screenshot

Source: CMAK

4. Burrow

Burrow logo

Burrow is an open-source Kafka monitoring tool focused on consumer lag analysis. Developed by LinkedIn, it evaluates Kafka consumer health by monitoring committed offsets and calculating consumer status dynamically without requiring manually defined lag thresholds. Burrow exposes monitoring data through HTTP endpoints and supports configurable alerting integrations.

License: Apache-2.0
Repo: https://github.com/linkedin/Burrow

Key features include:

  • Consumer lag monitoring: Tracks Kafka consumer lag using committed offsets from Kafka topics.
  • Threshold-free evaluation model: Uses sliding window analysis instead of fixed alert thresholds to evaluate consumer health.
  • Multiple cluster support: Monitors multiple Kafka clusters within a single deployment.
  • Automatic consumer monitoring: Detects and monitors Kafka consumers automatically.
  • HTTP monitoring API: Provides HTTP endpoints for consumer group status and Kafka cluster information.
  • Alert notification support: Supports configurable notifications through email and HTTP integrations.
  • Zookeeper offset support: Includes optional monitoring for Zookeeper-committed offsets.

5. Datadog

Datadog logo

Datadog provides monitoring and observability tools for Kafka environments through dashboards, metrics collection, alerting, and data stream monitoring. The platform tracks metrics across Kafka brokers, producers, consumers, JVM processes, and ZooKeeper to help identify bottlenecks, latency issues, replication problems, and throughput trends.

License: Commercial

Key features include:

  • Kafka performance dashboards: Provides prebuilt dashboards for visualizing Kafka metrics across brokers, consumers, producers, and ZooKeeper.
  • Broker health monitoring: Tracks broker metrics such as leader elections, network throughput, replication health, and offline partitions.
  • Consumer lag tracking: Monitors consumer lag by group and tracks consumption throughput over time.
  • Producer performance metrics: Measures request rates, latency, throughput, and producer I/O wait times.
  • JVM monitoring: Tracks Java garbage collection metrics and JVM performance for Kafka brokers.
  • ZooKeeper monitoring: Collects metrics related to ZooKeeper latency, active connections, outstanding requests, and synchronization activity.

Datadog dashboard screenshot

Source: Datadog

Best practices for effective Kafka monitoring

Here are some monitoring best practices to consider when using Apache Kafka.

1. Define an essential metrics set aligned with SLOs/SLAs

Kafka emits hundreds of metrics, but not all are critical. Begin by identifying a core set that directly maps to business goals and operational commitments. For example, if the SLO guarantees delivery within five seconds, then consumer lag, end-to-end latency, and throughput metrics are essential.

Include indicators of health for key components—such as under-replicated partitions (brokers), error rate (producers), and commit rate (consumers). Use dimensioned metrics (tagged by topic, partition, or client ID) to allow granular filtering. Custom metrics, like event processing latency from consumer applications, can also be added to align monitoring with application-level objectives.

This targeted approach prevents data overload and ensures monitoring efforts remain focused on what matters most to system reliability and customer impact.

2. Set meaningful alert thresholds

Alerts must be both timely and actionable. Set thresholds based on the service behavior under normal and degraded conditions. For instance, trigger an alert only if consumer lag exceeds a predefined threshold for more than 5 minutes, rather than on every spike.

Use dynamic thresholds where possible, such as those based on statistical baselines (e.g., 95th percentile latency) or moving averages. Prioritize alert severity based on business impact: use warnings for early detection and critical alerts when SLAs are at risk.

Group alerts by component to reduce noise. For example, if multiple brokers report errors, consolidate them into a single incident. Regularly review and tune thresholds to prevent alert fatigue and ensure incidents are meaningful.

3. Use historical baselines for anomaly detection and capacity planning

Establish historical baselines by collecting time-series data over weeks or months. This allows admins to define what “normal” looks like for metrics such as throughput, lag, and broker CPU usage. Use this baseline to detect anomalies—like a sudden drop in consumer fetch rate—which might not breach static thresholds but still indicate issues.

For capacity planning, track trends in disk usage, topic growth, and message rates. Analyze peak loads and growth curves to predict when infrastructure will need to scale. This approach supports proactive planning and helps avoid last-minute outages due to resource exhaustion.

Baselines are also useful in evaluating the impact of application deployments or configuration changes, enabling safer rollouts and performance tuning.

4. Implement real-time alerting

Kafka systems often require quick responses to prevent data loss or processing delays. Implement real-time alerting using stream-based metric collectors (e.g., Prometheus scraping JMX exporters). Configure alerts to trigger within seconds of detecting anomalies.

Integrate these alerts with on-call systems like PagerDuty or Slack, ensuring that critical information—such as broker ID, topic name, and exact metric value—is included. Real-time dashboards should support drill-down from high-level alerts to detailed metrics and logs for fast diagnosis.

Run synthetic checks (e.g., produce-consume tests) at regular intervals and alert on failures to detect issues not captured by native Kafka metrics.

5. Automate periodic health checks

In addition to reactive alerting, automate regular health checks that validate Kafka’s operational integrity. These can include:

  • Verifying that all partitions have leaders and replicas are in sync
  • Checking that consumer groups are committing offsets regularly
  • Ensuring no broker is overwhelmed or isolated
  • Running produce-consume tests to validate end-to-end message flow

Schedule these checks using cron jobs, monitoring frameworks, or CI/CD tools. Surface the results in dashboards and integrate failures with ticketing systems to enable tracking and resolution.

Automated health checks provide an added layer of defense, catching slow-developing problems before they impact production workflows.

Conclusion

Effective Kafka monitoring is critical for maintaining the performance, reliability, and security of streaming data pipelines. A well-designed monitoring strategy ensures early detection of issues, supports capacity planning, and helps maintain service-level objectives by providing real-time visibility into system behavior. By focusing on key metrics, implementing meaningful alerts, leveraging historical baselines, and automating health checks, organizations can proactively manage Kafka infrastructure and deliver robust, scalable data processing systems.