Cassandra Metrics

Menu

Non-table metrics follow the format n::{metricName}.

Each metric type will contain the latest available measurement.

  • n::reads – Reads per second by Cassandra.
    • Expected range: 1-10000 writes a second ( dependent on node type)
    • Impacting factors: Hardware capacity and configuration, client request load, compaction strategy, overall cluster health
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity.
  • n::writes – Writes per second by Cassandra.
    • Expected range: 1-10000 writes a second ( dependent on node type)
    • Impacting factors: Hardware capacity and configuration, client request load, compaction strategy, overall cluster health 
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity. 
  • n::compactions: Number of pending compactions.
    • Expected range: 10 → 100 (depends on the node size)
    • Impacting factors: Cluster migrations or a high-volume data write.
    • Troubleshooting: Throttle compaction throughput using nodetool set-compaction throughput 0.
  • n::clientRequestRead:  Offers the percentile distribution and average latency per client read request (i.e. the period from when a node receives a client request, gathers the records and respond to the client). Available sub-types:
    • 95thPercentile – 95th percentile distribution of clientRequestRead
    • 99thPercentile – 99th percentile distribution of clientRequestRead
    • Expected range: 5 ms → 200 ms 
    • Impacting factors: Hardware capacity and configuration, client request load, compaction strategy, overall cluster health
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity.
  • n::clientRequestWrite: Offers the percentile distribution and average latency per client write request (i.e. the period from when a node receives a client request, gathers the records and response to the client). Available sub-types:
    • 95thPercentile – 95th percentile distribution of clientRequestWrite
    • 99thPercentile – 99th percentile distribution of clientRequestWrite
    • Expected range: 5 ms → 200 ms 
    • Impacting factors: Hardware capacity and configuration, client request load, compaction strategy, overall cluster health 
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity. 
  • n::rangeSlices – Range Slice reads by Cassandra
  • n::casReads – Compare and Set reads by Cassandra
  • n::casWrites – Compare and Set writes by Cassandra
  • n::clientRequestRangeSlice – Offers the percentile distribution and average latency per client range slice read request (i.e. the period from when a node receives a client request, gathers the records and response to the client). Available sub-types:
    •  latency_per_operation – Latency per clientRequestRangeSlices read
    • 95thPercentile – 95th percentile distribution of clientRequestRangeSlices
    • 99thPercentile – 99th percentile distribution of clientRequestRangeSlices
    • Expected range: 10ms → 300ms
    • Troubleshooting: Evaluate range queries usage
    • Impacting factors: Data modelling, overall cluster health, configuration.
  • n::clientRequestCasRead – Offers the percentile distribution and average latency per client CAS read request (i.e. the period from when a node receives a client request, gathers the records and response to the client). Available sub-types:
    • 95thPercentile – 95th percentile distribution of clientRequestCasRead
    • 99thPercentile – 99th percentile distribution of clientRequestCasRead
    • Expected range: 10ms → 300ms 
    • Impacting factors: CAS query, data modelling, overall cluster health, configuration 
  • n::clientRequestCasWrite – Offers the percentile distribution and average latency per client CAS write request (i.e. the period from when a node receives a client request, gathers the records and respond to the client). Available sub-types:
    • 95thPercentile – 95th percentile distribution of clientRequestCasWrite
    • 99thPercentile – 99th percentile distribution of clientRequestCasWrite
    • Expected range: 10ms – 300ms 
    • Impacting factors: CAS query, data modelling, overall cluster health, configuration. 
  • n::slalatency – Monitors our SLA latency and alerts when it is above a threshold level. Available sub-types:
    • sla_read – This is the synthetic read queries against an Instaclustr canary table.
    • sla_write – This is the synthetic write queries against an Instaclustr canary table.
    • Expected range: up to 120ms for SLA reads, up to 20ms for SLA writes (if the cluster is not under excessive load)
    • Impacting factor: hHigh values can indicate that Cassandra is struggling to perform reads/writes in a timely manner. This implies that normal queries will also be experiencing higher-than-normal latency.
    • Troubleshooting: Check CPU, IO wait, levels of traffic, etc. May not be an issue depending on latency requirements.
  • Thread pool metrics
    • Description: Cassandra maintains distinct thread pools for different stages of execution. Each of the thread pools provides statistics on the number of tasks that are active, pending, and completed. The pending and blocked tasks in each pool are an indicator of some error or lack of capacity. The thread pool values can be monitored to establish a baseline for normal usage and peak load.
    • Impacting factors: Load on the node and cluster, hardware capacity, configuration tuning for each individual functionality.
    • Troubleshooting: If there is any specific error with a stage it should be resolved. The persistence of blocked and pending tasks indicates lack of capacity. The hardware capacity for a cluster should be carefully scaled considering Cassandra horizontal scaling guidelines. 

    Below are the thread pool names

    • n::readstage – The Read Stage metric represents Cassandra conducting reads from the local disk or cache.
    • n::mutationstage – The View Mutation Stage metric is responsible for materialised view writes.
    • n::nativetransportrequest – The Native Transport Request metric represents client CQL requests. If the requests are blocked by other Cassandra operations, this metric will display the abnormal values.
    • n::countermutationstage – Responsible for materialized view writes.
  • n::rpcthreadThe number of maximum concurrent requests from clients.
    • Expected range: Varies based on activity. Blocked native transport requests should be 0.
    • Impacting factor: Level and nature of traffic, CPU available.
    • Troubleshooting:Read and mutation stages: there should be 0 items pending. If this number is higher, the cluster is struggling with the current level of traffic and some reads/writes are waiting to be processed – high latency can be expected.Native transport requests: Native transport requests are any requests made via the CQL protocol – ie, normal client traffic. There are a limited number of threads available to process incoming requests. When all threads are in use, some requests wait in a queue (these are shown as pending, in the graph). If the queue fills up, some requests are silently rejected (blocked, in the graph). The server never replies, so this eventually causes a client-side timeout.  The main way to prevent blocked native transport requests is to throttle load, so the requests are performed over a longer period. If that is not possible, scaling horizontally can help. (Tuning queue settings on the server side can sometimes help – if and only if the queues are saturated over an extremely brief period.)
  • n::droppedmessage – The Dropped Messages metric represents the total number of dropped messages from all stages in the SEDA.
    • Expected range: 0
    • Impacting factors: Load on the cluster or a particular node, configuration settings, data model
    • Troubleshooting: Identify root cause from ‘Impacting factors’ above. Possible solutions:
      •  Increase hardware capacity for a node or number of nodes in the cluster. ◦ Tune buffers, caches.
      •  Revisit data model if the issue originates from the data model. 
  • n::hintstotal – Number of hint messages written to the node from the time Cassandra service starts.
    • n::hintssucceeded –  Number of hints successfully delivered.
      • Expected range: Usually zero.
      • Impacting factor: A non-zero value indicates nodes are being restarted or experiencing problems.
      • Troubleshooting: Often caused by long GC pauses, causing nodes to be seen as down for brief periods. Can indicate a serious problem.
    • n::hintsfailed – Number of hints that failed delivery.
      • Expected range: 0
      • Impacting factors: Overall load, cluster health, node uptime.
      • Troubleshooting: Check Dropwizard Metric name, JMX Mbean, Metric type
    • n::hintstimedout – Number of hints that timed out during delivery.
      • Expected range: 0
      • Troubleshooting:
          • Repair the node to make data consistent and use regular repairs on the cluster
          • Change the max_hint_window_ms for all nodes if a node is expected to be unavailable for more time during scheduled migration or disaster recovery

Hints Created Metrics

Hints Created metrics return the number of hints created on a node for each of the other nodes in the cluster. Metric results can be requested at a cluster/node level. Hint Created metrics follow the format hc with no additional parameters.

Cluster level:

Node level: 

An example node level request and response:

In the example result below, hints have been created for node ip: 3.229.114.4 on node: 6a81fe78-cfe8-480d-bd0c-ebb52ac6ab70 with ip address: 3.230.52.66.

Table Metrics

Table metric names follow the format cf::{keyspace}::{table}::{metricType}. Optionally, a ‘sub-type’ may be specified to return a specific part of the metric. For example,

will return the various distributions of the read latency metric.

will only return the 50th percentile distribution of the read latency metric.

Each metric type will contain the latest available measurement.

  • cf::{keyspace}::{table}::readLatencyDistribution: Measurement of local read latency for the table, on the individual node. Available sub-types:
    • 50thPercentile: 50th percentile distribution of read latency
    • 75thPercentile: 75th percentile distribution of read latency
    • 95thPercentile: 95th percentile distribution of read latency
    • 99thPercentile: 99th percentile distribution of read latency
    • Expected range: 5 ms → 200 ms
    • Impacting factors: Hardware capacity and configuration, client request load,       compaction strategy, overall cluster health
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity.
  • cf::{keyspace}::{table}::reads: General measurements of local read latency for the table, on the individual node. Available sub-types:
    • latency_per_operation: Average local read latency per second
    • count_per_second: Reads of the table performed on the individual node
  • cf::{keyspace}::{table}::writeLatencyDistribution: Metrics for local write latency for the table, on the individual node. Available sub-types:
    • 50thPercentile: 50th percentile distribution of write latency
    • 75thPercentile: 75th percentile distribution of write latency
    • 95thPercentile: 95th percentile distribution of write latency
    • 99thPercentile: 99th percentile distribution of write latency
    • Expected range: 5 ms → 200 ms 
    • Monitoring: Continuous monitoring with alerting if exceeds expected range.
    • Impacting factors: Hardware capacity and configuration, client request load, compaction  strategy, overall cluster health
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity.
  • cf::{keyspace}::{table}::writes: General measurements of local write latency for the table, on the individual node. Available sub-types:
    • latency_per_operation: Average local write latency per second
    • count_per_second: Writes to the table performed on the individual node
  • cf::{keyspace}::{table}::sstablesPerRead: SSTables accessed per read of the table on the individual node. Available sub-types:
    • average: Average SSTables accessed per read
    • max: Maximum SSTables accessed per read
    • Expected range: Less than 10.
    • Impacting factors: Data model, compaction strategy, write volume, repair operation.
    • Troubleshooting: Configure optimal compaction strategy for the table and use compaction- specific tools. Repair the cluster regularly. Revisit data model in case this is a frequent issue and other solutions do not rectify. 
  • cf::{keyspace}::{table}::tombstonesPerRead: Tombstoned cells accessed per read of the table on the individual node. Available sub-types:
    • average: Average tombstones accessed per read
    • max: Maximum tombstones accessed per read
    • Expected range: Zero (no data deletion queries), low.
    • Impacting factors: Data model, query pattern, compaction strategy, repair status
    • Troubleshooting: Consider an optimal compaction strategy for the table or configure compaction strategy for aggressive compaction eviction, use methods for tombstone eviction like major compaction, nodetool garbagecollect (beware of all implications as the mentioned methods are resource-intensive and drastically change the SSTables storage), revisit the data model and access pattern.
  • cf::{keyspace}::{table}::liveCellsPerRead: Live cells accessed per read of the table on the individual node. Available sub-types:
    • average: Average live cells accessed per read
    • max: Maximum live cells accessed per read
  • cf::{keyspace}::{table}::partitionSize: The size of partitions in the specified table in kb:
    • average: Average partition size
    • max: Maximum partition size
    • Expected range: 1KB → 10MB ideal range (100MB at maximum) 
    • Impacting factors: Data model, query pattern.
    • Troubleshooting: Revisit query pattern for amount of data included in a single partition for the table. If no quick fix is applicable, revisit data model. 
  • cf::{keyspace}::{table}::diskUsed: Live and total disk used by the table. Available sub-types:
    • livediskspaceused: Disk used by live cells
    • totaldiskspaceused: Disk used by both live cells and tombstones

Listing Monitored Tables

A list of monitored tables, grouped by keyspace, can be generated by making a GET request to:

The API will respond with the following packet:

Example: Response packet listing monitored tables

Clusters

Requesting ‘cluster’ metrics returns the requested measurements for each provisioned node in the cluster and follows the same format as the ‘nodes’ endpoint. All node metrics are available for use.

For example, this request:

would return the following response packet:

FREE TRIAL

Spin up a cluster in less
than 5 minutes.
(No credit card required)

Sign Up Now
Close

Site by Swell Design Group