Monitoring API Metrics Troubleshooting

Menu

This page contains information that will assist in using, interpreting and troubleshooting the Cassandra metrics returned by the Monitoring API.

More information on the API definitions for the metrics and Monitoring API endpoints can be found here.

Non-table metrics

  • n::reads
    • Expected range: 1-10000 writes a second ( dependent on node type)
    • Impacting factors: Hardware capacity and configuration, client request load, compaction strategy, overall cluster health
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity.

  • n::writes
    • Expected range: 1-10000 writes a second ( dependent on node type)
    • Impacting factors: Hardware capacity and configuration, client request load, compaction strategy, overall cluster health 
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity. 

  • n::compactions
    • Expected range: 10 → 100 (depends on the node size)
    • Impacting factors: Cluster migrations or a high-volume data write.
    • Troubleshooting: Throttle compaction throughput using nodetool set-compaction throughput 0.

  • n::clientRequestRead
    • Expected range: 5 ms → 200 ms 
    • Impacting factors: Hardware capacity and configuration, client request load, compaction strategy, overall cluster health
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity.

  • n::clientRequestWrite
    • Expected range: 5 ms → 200 ms 
    • Impacting factors: Hardware capacity and configuration, client request load, compaction strategy, overall cluster health 
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity. 

  • n::clientRequestRangeSlice
    • Expected range: 10ms → 300ms
    • Troubleshooting: Evaluate range queries usage
    • Impacting factors: Data modelling, overall cluster health, configuration.

  • n::clientRequestCasRead
    • Expected range: 10ms → 300ms 
    • Impacting factors: CAS query, data modelling, overall cluster health, configuration 

  • n::clientRequestCasWrite
    • Expected range: 10ms – 300ms 
    • Impacting factors: CAS query, data modelling, overall cluster health, configuration. 

  • n::slalatency
    • Expected range: up to 120ms for SLA reads, up to 20ms for SLA writes (if the cluster is not under excessive load)
    • Impacting factor: hHigh values can indicate that Cassandra is struggling to perform reads/writes in a timely manner. This implies that normal queries will also be experiencing higher-than-normal latency.
    • Troubleshooting: Check CPU, IO wait, levels of traffic, etc. May not be an issue depending on latency requirements.

  • Thread pool metrics
    • Description: Cassandra maintains distinct thread pools for different stages of execution. Each of the thread pools provides statistics on the number of tasks that are active, pending, and completed. The pending and blocked tasks in each pool are an indicator of some error or lack of capacity. The thread pool values can be monitored to establish a baseline for normal usage and peak load.
    • Impacting factors: Load on the node and cluster, hardware capacity, configuration tuning for each individual functionality.
    • Troubleshooting: If there is any specific error with a stage it should be resolved. The persistence of blocked and pending tasks indicates lack of capacity. The hardware capacity for a cluster should be carefully scaled considering Cassandra horizontal scaling guidelines. 

      Below are the thread pool names
    • n::readstage
    • n::mutationstage
    • n::nativetransportrequest
    • n::countermutationstage

  • n::rpcthread
    • Expected range: Varies based on activity. Blocked native transport requests should be 0.
    • Impacting factor: Level and nature of traffic, CPU available.
    • Troubleshooting:Read and mutation stages: there should be 0 items pending. If this number is higher, the cluster is struggling with the current level of traffic and some reads/writes are waiting to be processed – high latency can be expected.Native transport requests: Native transport requests are any requests made via the CQL protocol – ie, normal client traffic. There are a limited number of threads available to process incoming requests. When all threads are in use, some requests wait in a queue (these are shown as pending, in the graph). If the queue fills up, some requests are silently rejected (blocked, in the graph). The server never replies, so this eventually causes a client-side timeout.  The main way to prevent blocked native transport requests is to throttle load, so the requests are performed over a longer period. If that is not possible, scaling horizontally can help. (Tuning queue settings on the server side can sometimes help – if and only if the queues are saturated over an extremely brief period.)

  • n::droppedmessage
    • Expected range: 0
    • Impacting factors: Load on the cluster or a particular node, configuration settings, data model
    • Troubleshooting: Identify root cause from ‘Impacting factors’ above. Possible solutions:
      •  Increase hardware capacity for a node or number of nodes in the cluster. ◦ Tune buffers, caches.
      •  Revisit data model if the issue originates from the data model. 

  • n::hintssucceeded
    • Expected range: Usually zero.
    • Impacting factor: A non-zero value indicates nodes are being restarted or experiencing problems.
    • Troubleshooting: Often caused by long GC pauses, causing nodes to be seen as down for brief periods. Can indicate a serious problem.

  • n::hintsfailed – Number of hints that failed delivery.
    • Expected range: 0
    • Impacting factors: Overall load, cluster health, node uptime.
    • Troubleshooting: Check Dropwizard Metric name, JMX Mbean, Metric type

  • n::hintstimedout – Number of hints that timed out during delivery.
    • Expected range: 0
    • Troubleshooting:
      • Repair the node to make data consistent and use regular repairs on the cluster
      • Change the max_hint_window_ms for all nodes if a node is expected to be unavailable for more time during scheduled migration or disaster recovery

Table Metrics

  • cf::{keyspace}::{table}::readLatencyDistribution
    • Expected range: 5 ms → 200 ms
    • Impacting factors: Hardware capacity and configuration, client request load, compaction strategy, overall cluster health
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity.

  • cf::{keyspace}::{table}::writeLatencyDistribution
    • Expected range: 5 ms → 200 ms 
    • Monitoring: Continuous monitoring with alerting if exceeds expected range.
    • Impacting factors: Hardware capacity and configuration, client request load, compaction  strategy, overall cluster health
    • Troubleshooting: Focus on a problematic area – e.g. unusual load, cluster operations, high compaction, GC activity.

  • cf::{keyspace}::{table}::sstablesPerRead
    • Expected range: Less than 10.
    • Impacting factors: Data model, compaction strategy, write volume, repair operation.
    • Troubleshooting: Configure optimal compaction strategy for the table and use compaction- specific tools. Repair the cluster regularly. Revisit data model in case this is a frequent issue and other solutions do not rectify. 

  • cf::{keyspace}::{table}::tombstonesPerRead
    • Expected range: Zero (no data deletion queries), low.
    • Impacting factors: Data model, query pattern, compaction strategy, repair status
    • Troubleshooting: Consider an optimal compaction strategy for the table or configure compaction strategy for aggressive compaction eviction, use methods for tombstone eviction like major compaction, nodetool garbagecollect (beware of all implications as the mentioned methods are resource-intensive and drastically change the SSTables storage), revisit the data model and access pattern.

  • cf::{keyspace}::{table}::partitionSize
    • Expected range: 1KB → 10MB ideal range (100MB at maximum) 
    • Impacting factors: Data model, query pattern.
    • Troubleshooting: Revisit query pattern for amount of data included in a single partition for the table. If no quick fix is applicable, revisit data model. 

Need Support
Learn More

Already have an account?
Login to the Console

Experiencing difficulties on the website or console?
Status page for known incidents


Don’t have an account yet?
Sign up for a free trial

Why sign up?
To experience the ease of creating and managing clusters via the Instaclustr Console.