Kafka monitoring: Key metrics and 5 tools to know in 2026

What is Kafka monitoring?

Kafka monitoring involves continuously tracking cluster health, throughput, and latency to prevent data loss, broker overloads, and pipeline bottlenecks. Key metrics are exposed natively via JMX, and the standard observability stack combines the Prometheus JMX Exporter with Grafana for visualization, alongside dedicated tools for tracking consumer lag. In production environments, monitoring is crucial for identifying and resolving issues promptly, preventing downtime, and maintaining data integrity and security.

Effective Kafka observability focuses on a few core areas of cluster operations:

Consumer lag: The difference between the latest produced offset and the consumed offset. High lag means downstream systems are falling behind.
Under-replicated partitions: The number of partition replicas not in sync with the leader. A non-zero value often indicates broker failure.
Request handler and network idle time: The percentage of time broker threads sit idle. Low idle time points to an overloaded cluster.
Offline partitions: The number of partitions without an active leader, which renders that data inaccessible.

Common monitoring setups include:

NetApp Instaclustr: A fully managed service with built-in dashboards and automated alerts across brokers, producers, consumers, KRaft, and Kafka Connect.
Prometheus and Grafana: The open-source standard, using exporters to pull JMX metrics and community dashboards to visualize them.
CMAK: A web-based manager for inspecting cluster health and managing topics, partitions, and brokers across multiple clusters.
Burrow by LinkedIn: A dedicated, open-source consumer lag checker that evaluates consumer health without hardcoded offset thresholds.
Datadog: A commercial platform with prebuilt Kafka dashboards, alerting, and tracing across brokers, producers, and consumers.

Editor’s note: Updated the article to cover SLO reporting, updated information for Kafka monitoring solutions to reflect features and capabilities in 2026.

This is part of a series of articles about Apache Kafka

Why is monitoring Kafka important?

Monitoring Kafka is essential for maintaining system stability, performance, and security. It enables teams to detect issues before they escalate and ensures the platform runs efficiently under varying loads.

Capacity planning: Tracking metrics like storage usage, message throughput, and consumer lag helps forecast future resource needs. With these insights, teams can plan infrastructure growth, scale Kafka clusters appropriately, and avoid disruptions due to resource exhaustion.
Performance optimization: Monitoring provides visibility into system-level metrics such as CPU load, disk I/O, and network traffic. This data is key to identifying bottlenecks and tuning configurations. For example, analyzing consumer lag allows teams to spot slow consumers and adjust consumer group settings to maintain real-time processing.
Efficient troubleshooting: Kafka’s distributed nature makes debugging difficult without continuous monitoring. By correlating logs and metrics, teams can pinpoint issues quickly. For instance, simultaneous drops in response rate and increased timeouts in logs may indicate a broker problem, enabling targeted investigation and faster resolution.
Security and compliance: Monitoring also aids in detecting abnormal activity, such as unauthorized access or unusual data flows. It helps enforce compliance by tracking data access, retention policies, and audit logs, ensuring the Kafka environment meets security and regulatory requirements.

Related content: Read our guide to Kafka management

Key Kafka metrics explained

JMX Monitoring

JMX (Java Management Extensions) is the primary interface Kafka uses to expose metrics from brokers and clients. Kafka brokers use Yammer Metrics for internal metrics, while Java clients use Kafka Metrics, both of which support JMX. These metrics can be visualized with tools like jconsole or exported to external monitoring platforms.

Key metrics:

MessagesInPerSec: Incoming message rate per topic or cluster-wide
BytesInPerSec: Bytes received from clients per topic or overall
BytesOutPerSec: Bytes sent to clients per topic or overall
RequestMetrics.RequestsPerSec: Request rate per request type and version
RequestMetrics.ErrorsPerSec: Error rate per request type and error code
BrokerTopicMetrics.FailedProduceRequestsPerSec: Failed produce request rate
BrokerTopicMetrics.FailedFetchRequestsPerSec: Failed fetch request rate
RequestQueueSize: Size of the request queue
LogFlushRateAndTimeMs: Log flush rate and time
UnderReplicatedPartitions: Number of under-replicated partitions
IsrShrinksPerSec / IsrExpandsPerSec: ISR shrink and expansion rates
records-lag-max: Max consumer lag (from client JMX)

Tiered storage monitoring

Tiered storage allows Kafka to offload older log segments to external storage, reducing local disk usage. Monitoring this feature ensures timely data movement and highlights any issues with fetch or copy operations between local and remote tiers.

Key metrics:

RemoteFetchBytesPerSec: Bytes fetched from remote storage per topic
RemoteCopyBytesPerSec: Bytes written to remote storage per topic
RemoteFetchRequestsPerSec: Read request rate to remote storage
RemoteCopyRequestsPerSec: Write request rate to remote storage
RemoteCopyLagBytes: Bytes not yet tiered to remote storage
RemoteDeleteLagBytes: Tiered bytes pending deletion
RemoteLogSizeBytes: Total size of remote log
RemoteLogMetadataCount: Count of metadata entries for remote storage
RemoteLogReaderTaskQueueSize: Queue size of remote read tasks
RemoteLogManagerTasksAvgIdlePercent: Idle time of tiering thread pool

KRaft monitoring

KRaft (Kafka Raft Metadata Mode) replaces ZooKeeper in newer Kafka versions. Monitoring KRaft helps track metadata replication, controller state, quorum health, and election behavior.

Key metrics:

raft-metrics.CurrentState: Role of the node (e.g., leader, follower)
raft-metrics.CurrentLeader: ID of the current quorum leader
raft-metrics.HighWatermark: Quorum high watermark offset
raft-metrics.AppendRecordsRate: Record append rate
MetadataLoader.CurrentMetadataVersion: Active metadata version
SnapshotEmitter.LatestSnapshotGeneratedBytes: Size of latest metadata snapshot
KafkaController.ActiveControllerCount: Number of active controllers
KafkaController.FencedBrokerCount: Number of fenced brokers
KafkaController.MetadataErrorCount: Count of metadata processing errors

Selector monitoring

Selector metrics help monitor I/O activity in Kafka clients and workers. These include network readiness checks and time spent in I/O operations.

Key metrics:

select-rate: Number of I/O select calls per second
select-total: Total I/O select calls
io-wait-time-ns-avg: Average time waiting for I/O readiness
io-wait-ratio: Fraction of time spent waiting for I/O
io-time-ns-avg: Average I/O time per select call
io-ratio: Fraction of time spent on actual I/O work
connection-count: Current number of active connections

Common node monitoring

Node-level metrics track client interactions with specific Kafka broker nodes. These metrics offer insight into per-node request volume, data transfer, and latency.

Key metrics:

outgoing-byte-rate: Average outgoing bytes per second for a node
incoming-byte-rate: Average incoming bytes per second for a node
request-rate: Request rate per node
request-size-avg: Average request size per node
request-latency-avg: Average latency of requests per node
response-rate: Response rate per node
connection-close-rate: Rate of connection closures

Producer monitoring

Producer monitoring tracks how clients produce data, including buffering behavior, error rates, retries, and request latencies. These metrics help identify issues like buffer exhaustion or high retry volumes.

Key metrics:

record-send-rate: Records sent per second
record-error-rate: Error rate of record sends
record-retry-rate: Retry rate of record sends
requests-in-flight: In-flight produce requests
buffer-available-bytes: Available buffer memory
batch-size-avg: Average batch size in bytes
produce-throttle-time-avg: Average broker throttle time for producers
record-queue-time-avg: Time records wait in the send buffer

Consumer monitoring

Consumer metrics track how data is fetched and committed by clients. They include polling behavior, fetch rates, consumer lag, and group coordination performance.

Key metrics:

records-consumed-rate: Number of records consumed per second
records-lag-max: Maximum lag in records
fetch-latency-avg: Average latency for fetch requests
fetch-size-avg: Average fetch size
commit-rate: Rate of offset commits
rebalance-latency-avg: Time taken to rebalance
assigned-partitions: Number of partitions currently assigned
heartbeat-rate: Heartbeats per second sent to the group coordinator

Connect monitoring

Kafka Connect exposes metrics for worker-level operations, connectors, and individual tasks. These help monitor task lifecycle, rebalance events, and error handling.

Key metrics:

connector-count: Number of active connectors
task-count: Number of active tasks
rebalance-avg-time-ms: Average rebalance time
offset-commit-avg-time-ms: Average time to commit offsets
sink-record-lag-max: Max lag between consumer position and sink processing
sink-record-read-rate: Rate of records read from Kafka
source-record-write-rate: Rate of records written to Kafka by source connectors
deadletterqueue-produce-failures: Failed writes to dead-letter queue
total-record-errors: Number of record-level processing errors

Alerting and SLO-based monitoring

Setting up alerts and service level objectives (SLOs) ensures that issues in Kafka are detected early and resolved before they impact users or downstream systems. Instead of monitoring every metric, focus on those that indicate degraded service, data loss risk, or resource exhaustion.

Key areas to alert on:

Availability: Alert if UnderReplicatedPartitions is greater than 0 or if IsrShrinksPerSec spikes. These indicate replication issues that can lead to data unavailability. Also alert if OfflinePartitionsCount is greater than 0, which means partitions have no active leader and are not readable or writable.
Durability: Track LogFlushRateAndTimeMs and RemoteCopyLagBytes. Long delays in flushing logs or tiering data can risk data loss.
Throughput: Watch BytesInPerSec, BytesOutPerSec, and request rates. Sudden drops may signal client issues or bottlenecks.
Latency: Use metrics like request-latency-avg and fetch-latency-avg to catch rising response times. High latency often precedes timeouts or client failures.
Errors: Alert on ErrorsPerSec, FailedProduceRequestsPerSec, and record-error-rate. Persistent errors suggest broken producers, client misconfigurations, or broker instability.
Consumer Lag: records-lag-max is critical for detecting slow consumers. Alert if lag grows continuously without reduction.
Saturation: Watch RequestHandlerAvgIdlePercent and NetworkProcessorAvgIdlePercent. Sustained low idle time indicates broker or network saturation.
OfflinePartitionsCount: Number of partitions without an active leader, meaning that data is not readable or writable (alert if greater than 0)
RequestHandlerAvgIdlePercent: Fraction of time request handler (I/O) threads are idle; low values signal an overloaded broker
NetworkProcessorAvgIdlePercent: Fraction of time network threads are idle; low values signal a network bottleneck

Learn more in our detailed guide to Apache Kafka cluster

Tips from the expert

Andrew Mills

Senior Solution Architect

Andrew Mills is an industry leader with extensive experience in open source data solutions and a proven track record in integrating and managing Apache Kafka and other event-driven architectures

In my experience, here are tips that can help you better monitor Apache Kafka:

Monitor key broker metrics for cluster health: Keep a close eye on broker-level metrics such as CPU usage, disk I/O, and network throughput. Pay special attention to under-replicated partitions and offline partitions, as they indicate potential issues with data replication and availability.
Track consumer lag for performance insights: Consumer lag is a critical metric that hows the delay between message production and consumption. High lag can indicate slow consumers or bottlenecks in processing. Use Kafaka’s built in tools or a managed service like Instaclustr for monitoring solutions to track consumer group offsets and ensure they are keeping up with the producers.
Track network-level congestion and TCP retransmissions: Kafka is sensitive to network performance. Monitoring packet drops, retransmissions, and interface queue lengths helps identify issues like overloaded NICs or faulty firewalls that impair broker communication.
Leverage end-to-end monitoring for data flow visibility: Monitor the entire data pipeline, from producers to brokers to consumers, to identify bottlenecks or failures at any stage. Use tools like Kafka Connect to track the performance of connectors.

Notable Kafka monitoring tools

1. NetApp Instaclustr

NetApp Instaclustr logo

Instaclustr for Apache Kafka includes top-tier monitoring capabilities that simplify managing Kafka clusters, ensuring streaming data pipelines perform at their best with minimal effort.

Instaclustr for Kafka monitoring provides real-time visibility into the health and performance of clusters and proactively identifies potential issues before they escalate. Key metrics, such as throughput, partition distribution, consumer lag, and broker health, are tracked using advanced monitoring tools, providing insights needed to make data-driven decisions with confidence. Instaclustr’s monitoring capabilities make navigating Kafka’s intricate architecture straightforward, reducing downtime and keeping applications running smoothly.

Instaclustr includes automated platform alerts to detailed reporting, every feature is crafted to simplify workflows while maximizing Kafka’s potential. This empowers teams to focus on innovation rather than troubleshooting.

License: Apache-2.0
Repo: https://github.com/instaclustr

Key features include:

User friendly dashboard: Provides a ready to use dashboard that displays Kafka metrics from all major components – brokers, producers, consumers, KRaft, Kafka connect and displays general node metrics for cluster health like CPU usage, disk usage, and memory usage on nodes.
Automated monitoring and alerts: Stay ahead of potential problems with real-time monitoring and automated alerts that help ensure the reliability of your deployments.
Customized scaling: Easily scale your Kafka clusters to match your business needs, handling increasing workloads effortlessly.
Comprehensive reporting: Access detailed performance reports to analyze and optimize your Kafka workloads effectively.
Flexible integrations: Instaclustr’s Monitoring API is designed to allow you to integrate the monitoring information from your Instaclustr managed cluster with the monitoring tool used for the entire application – Prometheus, Data Dog, and more.
Fully managed service: Instaclustr takes care of the entire Kafka operation, from provisioning to maintenance, ensuring a seamless experience with minimal downtime.

Instaclustr dashboard screenshot

2. Prometheus

Prometheus logo

Prometheus is an open-source monitoring and alerting toolkit for collecting and storing time series metrics. It uses a pull-based model to gather metrics over HTTP and provides a query language called PromQL for analyzing collected data. Prometheus is commonly used for infrastructure, application, and microservices monitoring, including Kafka environments. For Kafka, Prometheus is typically paired with the JMX Exporter to pull broker metrics and with Grafana to visualize them on community dashboards.

License: Apache-2.0
Repo: https://github.com/prometheus/prometheus

Key features include:

Time series data collection: Stores metrics as time series data with timestamps and optional key-value labels for flexible querying.
PromQL query language: Provides a query language for filtering, aggregating, and analyzing metrics across distributed systems.
Pull-based monitoring model: Collects metrics by scraping HTTP endpoints exposed by monitored services.
Service discovery support: Automatically discovers monitoring targets through service discovery integrations or static configuration.
Standalone architecture: Each Prometheus server operates independently without relying on distributed storage systems.
Alerting capabilities: Integrates with Alertmanager to generate and route alerts based on metric conditions.

Prometheus dashboard screenshot

Source: Prometheus

3. CMAK (Cluster Manager for Apache Kafka)

CMAK logo

CMAK (Cluster Manager for Apache Kafka), previously known as Kafka Manager, is an open-source tool for managing and monitoring Apache Kafka clusters. It provides a web-based interface for inspecting cluster health, managing topics and partitions, monitoring brokers and consumers, and performing operational tasks such as replica elections and partition reassignment.

License: Apache-2.0
Repo: https://github.com/yahoo/CMAK

Key features include:

Multi-cluster management: Supports administration and monitoring of multiple Kafka clusters from a single interface.
Cluster state inspection: Displays information about brokers, topics, consumers, offsets, partition distribution, and replica assignments.
Topic management: Allows users to create, update, delete, and expand Kafka topics.
Partition reassignment tools: Generates partition assignments and supports partition reassignment operations across brokers.
Preferred replica election: Enables execution of preferred replica elections to rebalance leadership across brokers.
Batch partition operations: Supports bulk partition assignment and reassignment tasks for multiple topics.

CMAK dashboard screenshot

Source: CMAK

4. Burrow

Burrow logo

Burrow is an open-source Kafka monitoring tool focused on consumer lag analysis. Developed by LinkedIn, it evaluates Kafka consumer health by monitoring committed offsets and calculating consumer status dynamically without requiring manually defined lag thresholds. Burrow exposes monitoring data through HTTP endpoints and supports configurable alerting integrations.

License: Apache-2.0
Repo: https://github.com/linkedin/Burrow

Key features include:

Consumer lag monitoring: Tracks Kafka consumer lag using committed offsets from Kafka topics.
Threshold-free evaluation model: Uses sliding window analysis instead of fixed alert thresholds to evaluate consumer health.
Multiple cluster support: Monitors multiple Kafka clusters within a single deployment.
Automatic consumer monitoring: Detects and monitors Kafka consumers automatically.
HTTP monitoring API: Provides HTTP endpoints for consumer group status and Kafka cluster information.
Alert notification support: Supports configurable notifications through email and HTTP integrations.
Zookeeper offset support: Includes optional monitoring for Zookeeper-committed offsets.

5. Datadog

Datadog logo

Datadog provides monitoring and observability tools for Kafka environments through dashboards, metrics collection, alerting, and data stream monitoring. The platform tracks metrics across Kafka brokers, producers, consumers, JVM processes, and ZooKeeper to help identify bottlenecks, latency issues, replication problems, and throughput trends.

License: Commercial

Key features include:

Kafka performance dashboards: Provides prebuilt dashboards for visualizing Kafka metrics across brokers, consumers, producers, and ZooKeeper.
Broker health monitoring: Tracks broker metrics such as leader elections, network throughput, replication health, and offline partitions.
Consumer lag tracking: Monitors consumer lag by group and tracks consumption throughput over time.
Producer performance metrics: Measures request rates, latency, throughput, and producer I/O wait times.
JVM monitoring: Tracks Java garbage collection metrics and JVM performance for Kafka brokers.
ZooKeeper monitoring: Collects metrics related to ZooKeeper latency, active connections, outstanding requests, and synchronization activity.

Datadog dashboard screenshot

Source: Datadog

Best practices for effective Kafka monitoring

Here are some monitoring best practices to consider when using Apache Kafka.

1. Define an essential metrics set aligned with SLOs/SLAs

Kafka emits hundreds of metrics, but not all are critical. Begin by identifying a core set that directly maps to business goals and operational commitments. For example, if the SLO guarantees delivery within five seconds, then consumer lag, end-to-end latency, and throughput metrics are essential.

Include indicators of health for key components—such as under-replicated partitions (brokers), error rate (producers), and commit rate (consumers). Use dimensioned metrics (tagged by topic, partition, or client ID) to allow granular filtering. Custom metrics, like event processing latency from consumer applications, can also be added to align monitoring with application-level objectives.

This targeted approach prevents data overload and ensures monitoring efforts remain focused on what matters most to system reliability and customer impact.

2. Set meaningful alert thresholds

Alerts must be both timely and actionable. Set thresholds based on the service behavior under normal and degraded conditions. For instance, trigger an alert only if consumer lag exceeds a predefined threshold for more than 5 minutes, rather than on every spike.

Use dynamic thresholds where possible, such as those based on statistical baselines (e.g., 95th percentile latency) or moving averages. Prioritize alert severity based on business impact: use warnings for early detection and critical alerts when SLAs are at risk.

Group alerts by component to reduce noise. For example, if multiple brokers report errors, consolidate them into a single incident. Regularly review and tune thresholds to prevent alert fatigue and ensure incidents are meaningful.

3. Use historical baselines for anomaly detection and capacity planning

Establish historical baselines by collecting time-series data over weeks or months. This allows admins to define what “normal” looks like for metrics such as throughput, lag, and broker CPU usage. Use this baseline to detect anomalies—like a sudden drop in consumer fetch rate—which might not breach static thresholds but still indicate issues.

For capacity planning, track trends in disk usage, topic growth, and message rates. Analyze peak loads and growth curves to predict when infrastructure will need to scale. This approach supports proactive planning and helps avoid last-minute outages due to resource exhaustion.

Baselines are also useful in evaluating the impact of application deployments or configuration changes, enabling safer rollouts and performance tuning.

4. Implement real-time alerting

Kafka systems often require quick responses to prevent data loss or processing delays. Implement real-time alerting using stream-based metric collectors (e.g., Prometheus scraping JMX exporters). Configure alerts to trigger within seconds of detecting anomalies.

Integrate these alerts with on-call systems like PagerDuty or Slack, ensuring that critical information—such as broker ID, topic name, and exact metric value—is included. Real-time dashboards should support drill-down from high-level alerts to detailed metrics and logs for fast diagnosis.

Run synthetic checks (e.g., produce-consume tests) at regular intervals and alert on failures to detect issues not captured by native Kafka metrics.

5. Automate periodic health checks

In addition to reactive alerting, automate regular health checks that validate Kafka’s operational integrity. These can include:

Verifying that all partitions have leaders and replicas are in sync
Checking that consumer groups are committing offsets regularly
Ensuring no broker is overwhelmed or isolated
Running produce-consume tests to validate end-to-end message flow

Schedule these checks using cron jobs, monitoring frameworks, or CI/CD tools. Surface the results in dashboards and integrate failures with ticketing systems to enable tracking and resolution.

Automated health checks provide an added layer of defense, catching slow-developing problems before they impact production workflows.

Conclusion

Effective Kafka monitoring is critical for maintaining the performance, reliability, and security of streaming data pipelines. A well-designed monitoring strategy ensures early detection of issues, supports capacity planning, and helps maintain service-level objectives by providing real-time visibility into system behavior. By focusing on key metrics, implementing meaningful alerts, leveraging historical baselines, and automating health checks, organizations can proactively manage Kafka infrastructure and deliver robust, scalable data processing systems.

Kafka monitoring: Key metrics and 5 tools to know in 2026

What is Kafka monitoring?

Why is monitoring Kafka important?

Key Kafka metrics explained

JMX Monitoring

Tiered storage monitoring

KRaft monitoring

Selector monitoring

Common node monitoring

Producer monitoring

Consumer monitoring

Connect monitoring

Alerting and SLO-based monitoring

Tips from the expert

Notable Kafka monitoring tools

1. NetApp Instaclustr

2. Prometheus

3. CMAK (Cluster Manager for Apache Kafka)

4. Burrow

5. Datadog

Best practices for effective Kafka monitoring

1. Define an essential metrics set aligned with SLOs/SLAs

2. Set meaningful alert thresholds

3. Use historical baselines for anomaly detection and capacity planning

4. Implement real-time alerting

5. Automate periodic health checks

Conclusion

Spin up a clusterIn minutes

Spin up a cluster
In minutes