Kafka Metrics

Menu

Kafka specific metrics in the monitoring API begin with the k:: prefix, ie. k::underReplicatedPartitions.

Authentication

All requests to the API must use Basic Authentication and contain a valid username and monitoring API key. API keys are created per user account and can be retrieved via the Instaclustr Console from the Account > API Key tab.

Metrics

Metrics are requested by constructing a GET request, consisting of:

  • type: Either ‘clusters’, ‘datacentres’ or ‘nodes’.
    • ‘clusters’ returns the metrics for each node in the cluster.
    • ‘datacentres’ returns the metrics for each node belonging to the datacenter.
    • ‘nodes’ returns the metrics for a specific node.
  • UUID or public IP: If the type is set to ‘clusters’ or ‘datacentres’, then the UUID of cluster or datacentre must be specified. However, if the type is set to ‘nodes’, then either the nodes’ UUID or public IP may be specified.
  • metrics: The metrics to return are specified as a comma-delimited query string parameter. Up to 20 metrics may be specified.
  • reportNaN: (true|false) If a metric value is NaN or null, reportNaN determines whether API should report it as NaN. The default behaviour is false and NaN and null will be reported as 0. Setting ‘reportNaN=true’ will return NaN values in the API response.

General Metrics

Aside from the Kafka specific metrics, there are three generic node metrics which apply to all nodes regardless of what application is running on it.

These metrics are:

  • n::cpuUtilization: Current CPU used as a percentage of total available. Maximum value is 100%, regardless of the number of cores on the node.
  • n::osLoad: Current OS load. Generally, a node is overloaded if os load >= the number of cores on the node.
  • n::diskUtilization: Total disk space used by Kafka, as a percentage of total disk space available.

Example: Endpoint to return the partition count for each node in the cluster with a UUID of 7b58eae9-2b72-420a-a544-32a404b70fd7.

Kafka Metrics

Kafka specific metrics follow the format k::{metricName}.

The currently available metrics are:

  • k::kafkaBrokerState: The current state of the broker represented as an Integer:
    0. Not running
    1. Starting
    2. Recovering from unclean shutdown
    3. Running as broker
    6. Pending controlled shutdown
    7. Broker shutting down
  • k::underReplicatedParititions: The number of partitions that do not have enough replicas to meet the desired replication factor.
  • k::activeControllerCount: The number of active controllers in the cluster. The active controller of a cluster is usually the first node to start up.
  • k::offlinePartitions: The number of partitions without an active leader. Any partitions that are offline will not be accessible since read and write operations are only performed on the leader of a partition..
  • k::leaderElectionRate: The count, average, max, and one minute rate of leader elections per second.
  • k::uncleanLeaderElections: The number of failures to elect a suitable leader per second. In the case that no suitable leader can be chosen (ie. no available replicas are in sync), an out-of-sync replica will be elected as leader, resulting in data loss that is proportional to how out-of-sync the newly elected leader is.
  • k::produceRequestTime: The count, average, and max time taken to process requests from producers to send data. This is the sum of time spent waiting in request, time spent being processed by the leader, time spent waiting for follower response (if requests.required.acks = 1), and time taken to send the response.
  • k::fetchConsumerRequestTime: The average and max amount of time taken while processing, and the number of requests from consumers to get new data. This is the sum of time spent waiting in request, time spent being processed by the leader, time spent waiting for the leader to trigger sending the response (determined by fetch.min.bytes and fetch.wait.max.ms in the consumer configuration), and time taken to send the response.
  • k::fetchFollowerRequestTime: The count, average, and max amount of time taken while processing requests fromKafka brokers to get new data from partition leaders. This is the sum of time spent waiting in request, time spent being processed by the leader, and time taken to send the response.
  • k::brokerTopicMessagesIn: The mean and one minute rate of incoming messages per second.
  • k::brokerTopicBytesIn: The mean and one minute rate of incoming bytes to the cluster.
  • k::brokerTopicBytesOut: The mean and one minute rate of outgoing bytes from the cluster.
  • k::produceMessageConversionsPerSec: The one minute rate, mean rate, and number of produce requests per second that require message format conversion.
  • k::fetchMessageConversionsPerSec:The one minute rate, mean rate, and number of fetch requests per second that require message format conversion.
  • k::isrShrinkRate: The one minute rate, mean rate, and number of decreases in the number of In-Sync Replicas (ISR) per second. This metric is expected to change when adding or removing nodes from a cluster.
  • k::isrExpandRate: The one minute rate, mean rate, and number of increases in the number of In-Sync Replicas (ISR) per second. This metric is expected to change when adding or removing nodes from a cluster.
  • k::underMinIsrPartitions: The number of partitions where the number of In-Sync Replicas (ISR) is less than the minimum number of in-sync replicas specified.
  • k::partitionCount: The number of partitions on a node. The number of partitions should be evenly distributed across all nodes in a cluster.
  • k::leaderCount: The number of partitions that a node is a leader for. The number of partition leaders should be evenly distributed across all nodes in a cluster.
  • k::producePurgatorySize: The number of produce requests currently waiting in purgatory.
  • k::fetchPurgatorySize: The number of fetch requests currently waiting in purgatory.
  • k::networkProcessorAvgIdlePercent: The average percentage of time the network processors are idle, expressed as a number between 0 and 1. Kafka’s network processor threads are responsible for reading and writing data to Kafka clients across the network.
  • k::requestHandlerAvgIdlePercent: The average percentage of time Kafka’s request handler threads are idle, expressed as a number between 0 and 1. Kafka’s request handler threads are responsible for servicing client requests, including reading and writing messages to disk.
  • k::slaProducerMessagesProcessed: The number of synthetic transaction messages being successfully produced to each broker.
  • k::slaProducerLatencyMs: The average and maximum time taken in milliseconds to send a synthetic transaction message to each broker that is successfully replicated to the required number of minimum in-sync replicas.
  • k::slaProducerErrors: The number of errors encountered when producing synthetic transaction messages.
  • k::slaConsumerLatency: The average and maximum time in milliseconds between a synthetic transaction message being sent by the producer and being received by the consumer.
  • k::slaConsumerRecordsProcessed: The number of synthetic transaction messages being successfully consumed and processed on each broker.

Example: Endpoint to return the partition count and leader count for every node in the cluster  with a UUID of 7b58eae9-2b72-420a-a544-32a404b70fd7.

Response:

List Topics

To list all the topics for a cluster you can use the following endpoint

Example output

Per-Topic Metrics

Per-topic metric names follow the format kt::{topic}::{metricName}. Optionally, a ‘sub-type’ may be specified to return a specific part of the metric.

  • kt::{topic}::messagesInPerTopic: The rate of messages received by the topic. One sub-type must be specified. Available sub-types:
    • mean_rate: The average rate of messages received by the topic per second.
    • one_minute_rate: The one minute rate of messages received by the topic.
  • kt::{topic}::bytesOutPerTopic: The rate of outgoing bytes from the topic. One sub-type must be specified. Available sub-types:
    • mean_rate: The average rate of outgoing bytes from the topic per second.
    • one_minute_rate: The one minute rate of outgoing bytes from the topic.
  • kt::{topic}::bytesInPerTopic: The rate of incoming bytes to the topic per second. One sub-type must be specified. Available sub-types:
    • mean_rate: The average rate of incoming bytes to the topic per second.
    • one_minute_rate: The one minute rate of incoming bytes to the topic.
  • kt::{topic}::fetchMessageConversionsPerTopic: The amount and rate of fetch request messages which required message format conversions for the topic. One sub-type must be specified. Available sub-types:
    • count: The number of fetch request messages which required message format conversion.
    • mean_rate: The average rate of fetch request messages which required message format conversion per second.
    • one_minute_rate: The one minute rate of fetch request messages which required message format conversion.
  • kt::{topic}::produceMessageConversisonsPerTopic: The amount and rate of produce request messages which required message format conversions for the topic. One sub-type must be specified. Available sub-types:
    • count: The number of produce request messages which required message format conversion.
    • mean_rate: The average rate of produce request messages which required message format conversion per second.
    • one_minute_rate: The one minute rate of produce request messages which required message format conversion.
  • kt::{topic}::failedFetchMessagePerTopic: The amount and rate of failed fetch requests to the topic. One sub-type must be specified. Available sub-types:
    • count: The number of failed fetch requests to the topic.
    • mean_rate: The average rate of failed fetch requests to the topic per second.
    • one_minute_rate: The one minute rate of failed fetch requests to the topic.
  • kt::{topic}::failedProduceMessagePerTopic: The amount and rate of failed produce requests to the topic. One sub-type must be specified. Available sub-types:
    • count: The number of failed produce requests to the topic.
    • mean_rate: The average rate of failed produce requests to the topic per second.
    • one_minute_rate: The one minute rate of failed fetch requests to the topic.
  • kt::{topic}::diskUsage: The total size fo the files on disk associated with the topic, summed across all partitions. Available sub-types:
    • disk_usage_kilobytes: The total size fo the files on disk associated with the topic, summed across all partitions.

Example: Endpoint to return the mean rate of incoming messages for the topic ‘instaclustr-sla’ in the cluster with a UUID of 418b62a1-5831-41f3-b74d-f670c2b5cf18.

Response:

 

Consumer Groups

To list consumer groups for a kafka cluster use the following endpoint

Response :

To retrieve the information regarding the consumed topics and the clients use the following endpoint

Response : The response is a json object which will have the consumed topics as keys and the clients consuming as the value. In the following example client-1 and client-2 are consuming from the test-topic.

Example :

Please note that this value will always be an updated view of the current live clients and would not expose historical data. If it required to retrieve consumer group client metrics for clients that have been decommissioned it would require keeping track of the client schema.

Consumer Group Client Metrics

All metrics are reported under a consumer group and the consumed topic aggregated at a client level. A client within a consumer group is a logical grouping defined by setting the client.id configuration on a consumer.

The endpoint :

Available Metrics :

  • consumerLag : defined as the sum of consumer lag reported by all consumers with the same client id.
  • partitionCount : defined as the total number of partitions assigned to consumers with the same client id.
  • consumerCount :  defined as the total number of consumers with the same client id.

To retrieve metrics the consumerGroup parameter and topic parameter must be defined.
To control what metrics are retrieved define the metrics parameter using the above joined by commas. For example to retrieve both consumer lag and partition count define the metrics parameter as consumerLag,partitionCount. To retrieve only consumer lag define the metrics parameter simply as consumerLag.
The clientID parameter is optional and if not defined will retrieve all live client metrics. If the consumer group has a large number of unique clients defining the clientID is recommended for faster metric retrieval.

Response :

Consumer Group Metrics

All metrics are reported under a consumer group and the consumed topic aggregated at a group level.

To retrieve consumer group level metrics use the following endpoint

Available Metrics :

  • consumerGroupLag : defined as the sum of consumer lag reported by all consumers within the consumer group.
  • clientCount : defined as the total number of unique clients within the consumer group.

Response :

FREE TRIAL

Spin up a cluster in less
than 5 minutes.
(No credit card required)

Sign Up Now
Close

Site by Swell Design Group