Kafka Metrics Exposed in the Monitoring API

Menu

Kafka specific metrics in the monitoring API begin with the k:: prefix, ie. k::underReplicatedPartitions.

Authentication

All requests to the API must use Basic Authentication and contain a valid username and monitoring API key. API keys are created per user account and can be retrieved via the Instaclustr Console from the Account > API Key tab.

Metrics

Metrics are requested by constructing a GET request, consisting of:

  • type: Either ‘clusters’, ‘datacentres’ or ‘nodes’.
    • ‘clusters’ returns the metrics for each node in the cluster.
    • ‘datacentres’ returns the metrics for each node belonging to the datacenter.
    • ‘nodes’ returns the metrics for a specific node.
  • UUID or public IP: If the type is set to ‘clusters’ or ‘datacentres’, then the UUID of cluster or datacentre must be specified. However, if the type is set to ‘nodes’, then either the nodes’ UUID or public IP may be specified.
  • metrics: The metrics to return are specified as a comma-delimited query string parameter. Up to 20 metrics may be specified.
  • reportNaN: (true|false) If a metric value is NaN or null, reportNaN determines whether API should report it as NaN. The default behaviour is false and NaN and null will be reported as 0. Setting ‘reportNaN=true’ will return NaN values in the API response.

General Metrics

Aside from the Kafka specific metrics, there are three generic node metrics which apply to all nodes regardless of what application is running on it.

These metrics are:

  • n::cpuUtilization: Current CPU used as a percentage of total available. Maximum value is 100%, regardless of the number of cores on the node.
  • n::osLoad: Current OS load. Generally, a node is overloaded if os load >= the number of cores on the node.
  • n::diskUtilization: Total disk space used by Kafka, as a percentage of total disk space available.

Example: Endpoint to return the partition count for each node in the cluster with a UUID of 7b58eae9-2b72-420a-a544-32a404b70fd7.

Kafka Metrics

Kafka specific metrics follow the format k::{metricName}.

The currently available metrics are:

  • k::kafkaBrokerState: The current state of the broker represented as an Integer:
    0. Not running
    1. Starting
    2. Recovering from unclean shutdown
    3. Running as broker
    6. Pending controlled shutdown
    7. Broker shutting down
  • k::underReplicatedParititions: The number of partitions that do not have enough replicas to meet the desired replication factor.
  • k::activeControllerCount: The number of active controllers in the cluster. The active controller of a cluster is usually the first node to start up.
  • k::offlinePartitions: The number of partitions without an active leader. Any partitions that are offline will not be accessible since read and write operations are only performed on the leader of a partition..
  • k::leaderElectionRate: The count, average, max, and one minute rate of leader elections per second.
  • k::uncleanLeaderElectionsPerSec: The number of failures to elect a suitable leader per second. In the case that no suitable leader can be chosen (ie. no available replicas are in sync), an out-of-sync replica will be elected as leader, resulting in data loss that is proportional to how out-of-sync the newly elected leader is.
  • k::produceRequestTime: The count, average, and max time taken to process requests from producers to send data. This is the sum of time spent waiting in request, time spent being processed by the leader, time spent waiting for follower response (if requests.required.acks = 1), and time taken to send the response.
  • k::fetchConsumerRequestTime: The average and max amount of time taken while processing, and the number of requests from consumers to get new data. This is the sum of time spent waiting in request, time spent being processed by the leader, time spent waiting for the leader to trigger sending the response (determined by fetch.min.bytes and fetch.wait.max.ms in the consumer configuration), and time taken to send the response.
  • k::fetchFollowerRequestTime: The count, average, and max amount of time taken while processing requests fromKafka brokers to get new data from partition leaders. This is the sum of time spent waiting in request, time spent being processed by the leader, and time taken to send the response.
  • k::brokerTopicMessagesIn: The mean and one minute rate of incoming messages per second.
  • k::brokerTopicBytesIn: The mean and one minute rate of incoming bytes to the cluster.
  • k::brokerTopicBytesOut: The mean and one minute rate of outgoing bytes from the cluster.
  • k::produceMessageConversionsPerSec: The one minute rate, mean rate, and number of produce requests per second that require message format conversion.
  • k::fetchMessageConversionsPerSec:The one minute rate, mean rate, and number of fetch requests per second that require message format conversion.
  • k::isrShrinkRate: The one minute rate, mean rate, and number of decreases in the number of In-Sync Replicas (ISR) per second. This metric is expected to change when adding or removing nodes from a cluster.
  • k::isrExpandRate: The one minute rate, mean rate, and number of increases in the number of In-Sync Replicas (ISR) per second. This metric is expected to change when adding or removing nodes from a cluster.
  • k::underMinIsrPartitions: The number of partitions where the number of In-Sync Replicas (ISR) is less than the minimum number of in-sync replicas specified.
  • k::partitionCount: The number of partitions on a node. The number of partitions should be evenly distributed across all nodes in a cluster.
  • k::leaderCount: The number of partitions that a node is a leader for. The number of partition leaders should be evenly distributed across all nodes in a cluster.
  • k::producePurgatorySize: The number of produce requests currently waiting in purgatory.
  • k::fetchPurgatorySize: The number of fetch requests currently waiting in purgatory.
  • k::networkProcessorAvgIdlePercent: The average percentage of time the network processors are idle, expressed as a number between 0 and 1. Kafka’s network processor threads are responsible for reading and writing data to Kafka clients across the network.
  • k::requestHandlerAvgIdlePercent: The average percentage of time Kafka’s request handler threads are idle, expressed as a number between 0 and 1. Kafka’s request handler threads are responsible for servicing client requests, including reading and writing messages to disk.
  • k::replicationLag: The number of messages separating (aka. the delay between) the leader from the replica.
  • k::slaProducerMessagesProcessed: The number of synthetic transaction messages being successfully produced to each broker.
  • k::slaProducerLatencyMs: The average and maximum time taken in milliseconds to send a synthetic transaction message to each broker that is successfully replicated to the required number of minimum in-sync replicas.
  • k::slaProducerErrors: The number of errors encountered when producing synthetic transaction messages.
  • k::slaConsumerLatency: The average and maximum time in milliseconds between a synthetic transaction message being sent by the producer and being received by the consumer.
  • k::slaConsumerRecordsProcessed: The number of synthetic transaction messages being successfully consumed and processed on each broker.

Example: Endpoint to return the partition count and leader count for every node in the cluster  with a UUID of 7b58eae9-2b72-420a-a544-32a404b70fd7.

Response:

Per-Topic Metrics

Per-topic metric names follow the format kt::{topic}::{metricName}. Optionally, a ‘sub-type’ may be specified to return a specific part of the metric.

  • kt::{topic}::messagesInPerTopic: The rate of messages received by the topic. One sub-type must be specified. Available sub-types:
    • mean_rate: The average rate of messages received by the topic per second.
    • one_minute_rate: The one minute rate of messages received by the topic.
  • kt::{topic}::bytesOutPerTopic: The rate of outgoing bytes from the topic. One sub-type must be specified. Available sub-types:
    • mean_rate: The average rate of outgoing bytes from the topic per second.
    • one_minute_rate: The one minute rate of outgoing bytes from the topic.
  • kt::{topic}::bytesInPerTopic: The rate of incoming bytes to the topic per second. One sub-type must be specified. Available sub-types:
    • mean_rate: The average rate of incoming bytes to the topic per second.
    • one_minute_rate: The one minute rate of incoming bytes to the topic.
  • kt::{topic}::fetchMessageConversionsPerTopic: The amount and rate of fetch request messages which required message format conversions for the topic. One sub-type must be specified. Available sub-types:
    • count: The number of fetch request messages which required message format conversion.
    • mean_rate: The average rate of fetch request messages which required message format conversion per second.
    • one_minute_rate: The one minute rate of fetch request messages which required message format conversion.
  • kt::{topic}::produceMessageConversisonsPerTopic: The amount and rate of produce request messages which required message format conversions for the topic. One sub-type must be specified. Available sub-types:
    • count: The number of produce request messages which required message format conversion.
    • mean_rate: The average rate of produce request messages which required message format conversion per second.
    • one_minute_rate: The one minute rate of produce request messages which required message format conversion.
  • kt::{topic}::failedFetchMessagePerTopic: The amount and rate of failed fetch requests to the topic. One sub-type must be specified. Available sub-types:
    • count: The number of failed fetch requests to the topic.
    • mean_rate: The average rate of failed fetch requests to the topic per second.
    • one_minute_rate: The one minute rate of failed fetch requests to the topic.
  • kt::{topic}::failedProduceMessagePerTopic: The amount and rate of failed produce requests to the topic. One sub-type must be specified. Available sub-types:
    • count: The number of failed produce requests to the topic.
    • mean_rate: The average rate of failed produce requests to the topic per second.
    • one_minute_rate: The one minute rate of failed fetch requests to the topic.

Example: Endpoint to return the mean rate of incoming messages for the topic ‘instaclustr-sla’ in the cluster with a UUID of 418b62a1-5831-41f3-b74d-f670c2b5cf18.

Response:

Site by Swell Design Group