Kafka Metrics

Kafka specific metrics in the monitoring API begin with the k:: prefix, ie. k::underReplicatedPartitions.

Authentication

All requests to the API must use Basic Authentication and contain a valid username and monitoring API key. API keys are created per user account and can be retrieved via the Instaclustr Console from the Account > API Key tab.

Metrics

Metrics are requested by constructing a GET request, consisting of:

  • type: Either ‘clusters’, ‘datacentres’ or ‘nodes’.
    • ‘clusters’ returns the metrics for each node in the cluster.
    • ‘datacentres’ returns the metrics for each node belonging to the datacenter.
    • ‘nodes’ returns the metrics for a specific node.
  • UUID or public IP: If the type is set to ‘clusters’ or ‘datacentres’, then the UUID of cluster or datacentre must be specified. However, if the type is set to ‘nodes’, then either the nodes’ UUID or public IP may be specified.
  • metrics: The metrics to return are specified as a comma-delimited query string parameter. Up to 20 metrics may be specified.
  • reportNaN: (true|false) If a metric value is NaN or null, reportNaN determines whether API should report it as NaN. The default behaviour is false and NaN and null will be reported as 0. Setting ‘reportNaN=true’ will return NaN values in the API response.

General Metrics

Aside from the Kafka specific metrics, there are three generic node metrics which apply to all nodes regardless of what application is running on it.

These metrics are:

  • n::cpuUtilization: Current CPU used as a percentage of total available. Maximum value is 100%, regardless of the number of cores on the node.
  • n::osLoad: Current OS load. Generally, a node is overloaded if os load >= the number of cores on the node.
  • n::diskUtilization: Total disk space used by Kafka, as a percentage of total disk space available.
  • n::cpuguestpercent : Time spent running a virtual CPU for guest OS’ under control of kernel
  • n::cpuguestnicepercent : Niced processes executing in user mode in virtual OS
  • n::cpusystempercent : Percentage of processes executing in kernel mode
  • n::cpuidlepercent : Percentage of time when one or more kernel threads are executing with the run queue empty and/or no I/O operations are currently cycling.
  • n::cpuiowaitpercent : CPU time the I/O thread spent waiting for a socket ready for reads or writes as a percent
  • n::cpuirqpercent : Number of hardware interrupts the kernel is servicing
  • n::cpunicepercent : Percentage of processes executing in user mode which have a positive nice value
  • n::cpusoftirqpercent : Number of software interrupts the kernel is servicing
  • n::cpustealpercent : Percentage of time the hypervisor allocated to other tasks external to the one run on the current virtual CPU
  • n::cpuuserpercent : Processes executing in user mode, including application processes
  • n::memavailable : Estimate of how much memory is available to start new applications without swap, taking into account page cache and re-claimability of slab.
  • n::networkoutdelta : Delta count of bytes transmitted
  • n::networkindelta : Delta count of bytes received
  • n::networkouterrorsdelta : Delta count of transmit errors detected.
  • n::networkinerrorsdelta : Delta count of receive errors detected.
  • n::networkoutdroppeddelta : Delta count of transmit packets dropped.
  • n::networkindroppeddelta : Delta count of receive packets dropped.
  • n::tcpall : Total number of TCP connections in all state.
  • n::tcpestablished : Number of open TCP connections.
  • n::tcplistening : Number of TCP sockets waiting for a connection request from any remote TCP and port.
  • n::tcptimewait : Number of TCP sockets waiting for enough time to pass to be sure the remote TCP received the acknowledgment of its connection termination request.
  • n::tcpclosewait : Number of TCP sockets which connection is in the process of being closed.
  • n::filedescriptorlimit : Maximum number of open files limit for the node OS.
  • n::filedescriptoropencount : Current number of open files in the node OS.

Example: Endpoint to return the partition count for each node in the cluster with a UUID of 7b58eae9-2b72-420a-a544-32a404b70fd7.

Kafka Metrics

Kafka specific metrics follow the format k::{metricName}.

The currently available metrics are:

  • k::kafkaBrokerState: The current state of the broker represented as an Integer:
    0. Not running
    1. Starting
    2. Recovering from unclean shutdown
    3. Running as broker
    6. Pending controlled shutdown
    7. Broker shutting down
  • k::underReplicatedParititions: The number of partitions that do not have enough replicas to meet the desired replication factor.
  • k::activeControllerCount: The number of active controllers in the cluster. The active controller of a cluster is usually the first node to start up.
  • k::offlinePartitions: The number of partitions without an active leader. Any partitions that are offline will not be accessible since read and write operations are only performed on the leader of a partition..
  • k::leaderElectionRate: The count, average, max, and one minute rate of leader elections per second.
  • k::uncleanLeaderElections: The number of failures to elect a suitable leader per second. In the case that no suitable leader can be chosen (ie. no available replicas are in sync), an out-of-sync replica will be elected as leader, resulting in data loss that is proportional to how out-of-sync the newly elected leader is.
  • k::produceRequestTime: The count, average, 99th percentile distribution and max time taken to process requests from producers to send data. This is the sum of time spent waiting in request, time spent being processed by the leader, time spent waiting for follower response (if requests.required.acks = 1), and time taken to send the response.
  • k::fetchConsumerRequestTime: The count, average, 99th percentile distribution and max amount of time taken while processing, and the number of requests from consumers to get new data. This is the sum of time spent waiting in request, time spent being processed by the leader, time spent waiting for the leader to trigger sending the response (determined by fetch.min.bytes and fetch.wait.max.ms in the consumer configuration), and time taken to send the response.
  • k::fetchFollowerRequestTime: The count, average, and max amount of time taken while processing requests fromKafka brokers to get new data from partition leaders. This is the sum of time spent waiting in request, time spent being processed by the leader, and time taken to send the response.
  • k::metadataRequestTime : The 99th percentile distribution and max amount of time taken while processing requests from Kafka brokers to retrieve metadata. This is the sum of time spent waiting in request, time spent being processed by the leader, and time taken to send the response.
  • k::produceRequestLocalTime : The 99th percentile distribution and max amount of time taken by the leader to process requests from producers to send data.
  • k::fetchConsumerRequestLocalTime : The 99th percentile distribution and max amount of time spent being processed by the leader from consumer requests to get new data.
  • k::metadataRequestLocalTime : The 99th percentile distribution and max amount of time spent being processed by the leader while processing requests from Kafka brokers to retrieve metadata.
  • k::produceRequestRemoteTime : The 99th percentile distribution and max amount of time taken waiting for the follower to process requests from producers to send data.
  • k::fetchConsumerRequestRemoteTime : The 99th percentile distribution and max amount of time waiting for the follower from consumer requests to get new data.
  • k::metadataRequestRemoteTime : The 99th percentile distribution and max amount of time waiting for the follower while processing requests from Kafka brokers to retrieve metadata.
  • k::produceRequestQueueTime : The 99th percentile distribution and max amount of time the request waits in the request queue to process requests from producers to send data.
  • k::fetchConsumerRequestQueueTime : The 99th percentile distribution and max amount of time the request waits in the request queue from consumer requests to get new data.
  • k::metadataRequestQueueTime : The 99th percentile distribution and max amount of time the request waits in the request queue while processing requests from Kafka brokers to retrieve metadata.
  • k::produceResponseQueueTime : The 99th percentile distribution and max amount of time the request waits in the response queue to process requests from producers to send data.
  • k::fetchConsumerResponseQueueTime : The 99th percentile distribution and max amount of time the request waits in the response queue from consumer requests to get new data.
  • k::metadataResponseQueueTime : The 99th percentile distribution and max amount of time the request waits in the response queue while processing requests from Kafka brokers to retrieve metadata.
  • k::brokerTopicMessagesIn: The mean and one minute rate of incoming messages per second.
  • k::zooKeeperExpires: The count of Zookeeper client session expiration from the ensemble.
  • k::brokerTopicBytesOut: The mean and one minute rate of outgoing bytes from the cluster.
  • k::replicationBytesOutPerSec: The count of outgoing bytes due to internal replication.
  • k::produceMessageConversionsPerSec: The one minute rate, mean rate, and number of produce requests per second that require message format conversion.
  • k::fetchMessageConversionsPerSec:The one minute rate, mean rate, and number of fetch requests per second that require message format conversion.
  • k::isrShrinkRate: The one minute rate, mean rate, and number of decreases in the number of In-Sync Replicas (ISR) per second. This metric is expected to change when adding or removing nodes from a cluster.
  • k::isrExpandRate: The one minute rate, mean rate, and number of increases in the number of In-Sync Replicas (ISR) per second. This metric is expected to change when adding or removing nodes from a cluster.
  • k::underMinIsrPartitions: The number of partitions where the number of In-Sync Replicas (ISR) is less than the minimum number of in-sync replicas specified.
  • k::partitionCount: The number of partitions on a node. The number of partitions should be evenly distributed across all nodes in a cluster.
  • k::leaderCount: The number of partitions that a node is a leader for. The number of partition leaders should be evenly distributed across all nodes in a cluster.
  • k::producePurgatorySize: The number of produce requests currently waiting in purgatory.
  • k::fetchPurgatorySize: The number of fetch requests currently waiting in purgatory.
  • k::networkProcessorAvgIdlePercent: The average percentage of time the network processors are idle, expressed as a number between 0 and 1. Kafka’s network processor threads are responsible for reading and writing data to Kafka clients across the network.
  • k::requestHandlerAvgIdlePercent: The average percentage of time Kafka’s request handler threads are idle, expressed as a number between 0 and 1. Kafka’s request handler threads are responsible for servicing client requests, including reading and writing messages to disk.
  • k::slaProducerMessagesProcessed: The number of synthetic transaction messages being successfully produced to each broker.
  • k::slaProducerLatencyMs: The average and maximum time taken in milliseconds to send a synthetic transaction message to each broker that is successfully replicated to the required number of minimum in-sync replicas.
  • k::slaProducerErrors: The number of errors encountered when producing synthetic transaction messages.
  • k::slaConsumerLatency: The average and maximum time in milliseconds between a synthetic transaction message being sent by the producer and being received by the consumer.
  • k::slaConsumerRecordsProcessed: The number of synthetic transaction messages being successfully consumed and processed on each broker.
  • k::youngGenLastGC : Time taken for GC to run young generation during the latest event.
  • k::oldGengcCollectionTime : Total time taken for GC to run old generation.
  • k::logFlushRate : The total count, one minute rate and mean rate of Kafka log flush.
  • k::logFlushTime : The average time and maximum time of Kafka log flush.
  • k::brokerFetcherLagConsumerLag : The lag in the number of messages per follower replica aggregated at a broker level. Please note that brokers would not report this metric if it is not following a partition. For example all topics in the cluster is created with a replication factor of 1.
  • k::controlPlaneNetworkProcessorAvgIdlePercent : The idle percentage of pinned control plane network thread.
  • k::controlPlaneRequestHandlerAvgIdlePercent : The one minute rate and mean rate of the idle percentage of pinned control plane request handler thread.
  • k::controlPlaneExpiredConnectionsKilledCount : The number of expired connections that are disconnected on the control plane.
  • k::produceRequestsPerSec : The one minute rate, mean rate, and number of produce requests, since the beginning of program running. This only works for period below 3h.
  • k::fetchConsumerRequestsPerSec : The one minute rate, mean rate, and number of requests from consumer requests to get new data, since the beginning of program running. This only works for period below 3h.
  • k::fetchFollowerRequestsPerSec : The one minute rate, mean rate, and number of requests from Kafka brokers to get new data from partition leaders, since the beginning of program running. This only works for period below 3h.
  • k::partitionLoadTimeAvg: The average time of Consumer Group Coordinator to load the Commit Offset partition in 30 seconds interval. This is only available for Kafka 2.5.1+.
  • k::partitionLoadTimeMax: The maximum time of Consumer Group Coordinator to load the Commit Offset partition in 30 seconds interval. This is only available for Kafka 2.5.1+.
  • k::replicaFetcherMaxLag: The max message count lag between all fetchers/topics/partitions. This is only available for Kafka 2.5.1+.
    k::replicaFetcherFailedPartitionsCount – Increment count when partition truncation fails, storage exception is encountered, partition has older epoch than current leader or any other error encountered during fetch request. This is only available for Kafka 2.5.1+.
  • k::replicaFetcherMinFetchRate: The minimum number of messages fetched in one minute interval between all fetchers/topics/partitions. This is only available for Kafka 2.5.1+.
  • k::replicaFetcherDeadThreadCount: The number of failed fetcher threads. This is only available for Kafka 2.5.1+.

Example: Endpoint to return the partition count and leader count for every node in the cluster  with a UUID of 7b58eae9-2b72-420a-a544-32a404b70fd7.

Response:

List Topics

To list all the topics for a cluster you can use the following endpoint

Example output

Broker Level Per-Topic Metrics

Per-topic metric names follow the format kt::{topic}::{metricName}. Optionally, a ‘sub-type’ may be specified to return a specific part of the metric.

  • kt::{topic}::messagesInPerTopic: The rate of messages received by the topic. One sub-type must be specified. Available sub-types:
    • mean_rate: The average rate of messages received by the topic per second.
    • one_minute_rate: The one minute rate of messages received by the topic.
  • kt::{topic}::bytesOutPerTopic: The rate of outgoing bytes from the topic. One sub-type must be specified. Available sub-types:
    • mean_rate: The average rate of outgoing bytes from the topic per second.
    • one_minute_rate: The one minute rate of outgoing bytes from the topic.
  • kt::{topic}::bytesInPerTopic: The rate of incoming bytes to the topic per second. One sub-type must be specified. Available sub-types:
    • mean_rate: The average rate of incoming bytes to the topic per second.
    • one_minute_rate: The one minute rate of incoming bytes to the topic.
  • kt::{topic}::fetchMessageConversionsPerTopic: The amount and rate of fetch request messages which required message format conversions for the topic. One sub-type must be specified. Available sub-types:
    • count: The number of fetch request messages which required message format conversion.
    • mean_rate: The average rate of fetch request messages which required message format conversion per second.
    • one_minute_rate: The one minute rate of fetch request messages which required message format conversion.
  • kt::{topic}::produceMessageConversisonsPerTopic: The amount and rate of produce request messages which required message format conversions for the topic. One sub-type must be specified. Available sub-types:
    • count: The number of produce request messages which required message format conversion.
    • mean_rate: The average rate of produce request messages which required message format conversion per second.
    • one_minute_rate: The one minute rate of produce request messages which required message format conversion.
  • kt::{topic}::failedFetchMessagePerTopic: The amount and rate of failed fetch requests to the topic. One sub-type must be specified. Available sub-types:
    • count: The number of failed fetch requests to the topic.
    • mean_rate: The average rate of failed fetch requests to the topic per second.
    • one_minute_rate: The one minute rate of failed fetch requests to the topic.
  • kt::{topic}::failedProduceMessagePerTopic: The amount and rate of failed produce requests to the topic. One sub-type must be specified. Available sub-types:
    • count: The number of failed produce requests to the topic.
    • mean_rate: The average rate of failed produce requests to the topic per second.
    • one_minute_rate: The one minute rate of failed fetch requests to the topic.
  • kt::{topic}::diskUsage: The total size fo the files on disk associated with the topic, summed across all partitions. Available sub-types:
    • disk_usage_kilobytes: The total size fo the files on disk associated with the topic, summed across all partitions.

Example: Endpoint to return the mean rate of incoming messages for the topic ‘instaclustr-sla’ in the cluster with a UUID of 418b62a1-5831-41f3-b74d-f670c2b5cf18.

Response:

Cluster Level Per-Topic Metrics

Cluster level per topic metrics will be exposed via the endpoint

This endpoint will accept the query parameters

  • metrics : The metrics to return specified as a comma-delimited query string parameter. Up to 20 metrics may be specified. Formatted as “metrics=<metric-name>::<sub-type>,<metric-name>”
  • period and type : The period of time from which monitoring information is returned. It is also assigned a period type. Formatted as: “period=<period>&type=<period type>”. Accepted values are described in https://www.instaclustr.com/support/api-integrations/api-reference/monitoring-api/. If not specified latest metric would be retrieved
  • format : if you require the metrics to be formatted as prometheus output specify “format=prometheus”

To request the same metrics for all topics do not define the topic in the path,

if the number of metrics retrieved by the query exceeds 20, above endpoint will paginate through the topics using the query parameter of pageNumber.

Available Metrics

  • topicMessageDistribution : Metrics derived by analysing the message distribution among partitions of a topic. Metrics will be reported for non internal topics only.
    • outliers : Number of partitions identified as outliers using the statistical method of MADe (https://dipot.ulb.ac.be/dspace/bitstream/2013/139499/1/Leys_MAD_final-libre.pdf). With the high and low fences defined by (median ± 2 * 1.4826 * MAD). The metric will also return a JSON array of outlier partitions and their message counts. This metric will be limited to periods of 1h or below for retrieval.
    • standard_deviation : the population standard deviation of message distribution across partitions for the topic

Example : metric defined without specifying subtype will return all subtypes for that metric

Response :

Consumer Groups

To list consumer groups for a kafka cluster use the following endpoint

Response :

To retrieve the information regarding the consumed topics and the clients use the following endpoint

Response : The response is a json object which will have the consumed topics as keys and the clients consuming as the value. In the following example client-1 and client-2 are consuming from the test-topic.

Example :

Please note that this value will always be an updated view of the current live clients and would not expose historical data. If it required to retrieve consumer group client metrics for clients that have been decommissioned it would require keeping track of the client schema.

Consumer Group Client Metrics

All metrics are reported under a consumer group and the consumed topic aggregated at a client level. A client within a consumer group is a logical grouping defined by setting the client.id configuration on a consumer.

The endpoint :

Available Metrics :

  • consumerLag : defined as the sum of consumer lag reported by all consumers with the same client id.
  • partitionCount : defined as the total number of partitions assigned to consumers with the same client id.
  • consumerCount :  defined as the total number of consumers with the same client id.

To retrieve metrics the consumerGroup parameter and topic parameter must be defined.
To control what metrics are retrieved define the metrics parameter using the above joined by commas. For example to retrieve both consumer lag and partition count define the metrics parameter as consumerLag,partitionCount. To retrieve only consumer lag define the metrics parameter simply as consumerLag.
The clientID parameter is optional and if not defined will retrieve all live client metrics. If the consumer group has a large number of unique clients defining the clientID is recommended for faster metric retrieval.

Response :

Consumer Group Metrics

All metrics are reported under a consumer group and the consumed topic aggregated at a group level.

To retrieve consumer group level metrics use the following endpoint

Available Metrics :

  • consumerGroupLag : defined as the sum of consumer lag reported by all consumers within the consumer group.
  • clientCount : defined as the total number of unique clients within the consumer group.

Response :

Need Support
Learn More

Already have an account?
Login to the Console

Experiencing difficulties on the website or console?
Status page for known incidents


Don’t have an account yet?
Sign up for a free trial

Why sign up?
To experience the ease of creating and managing clusters via the Instaclustr Console.