Monitoring API

Menu

The monitoring API currently provides the following monitoring information:

  • Long-term cluster health indicators
  • Metrics for:
    • Cassandra status
    • reads and writes operations per second
    • CPU utilization
    • disk utilization
    • pending compactions

Metrics information is provided with either for an individual node or for all nodes in a cluster and cluster data centre.

The API also provides key statistics for each table in the cluster (similar to what is available through “nodetool tablehistograms”):

  • read & write counts (mean, distribution)
  • read & write latency (mean, distribution)
  • live cells & tombstones per read (mean, max)
  • number of sstables read for each read operation (mean, max)

The set of available metrics will expand as we build out this API. Descriptions of each of the metrics can be found in the monitoring section of this support site:
https://www.instaclustr.com/support/documentation/monitoring-information/

Authentication

All requests to the API must use Basic Authentication and contain a valid username and the monitoring API key. API keys are created per user account and can be retrieved via the Instaclustr Console from the Account > API Key tab.

All available metrics are updated every 20 seconds (i.e. requesting the same metric twice in 20 seconds will always return the same response).

Cluster Health Indicator

Cluster Health Indicator API provides a summary of indicators on the long-term health of your cluster and is retrieved by making a GET request to https://api.instaclustr.com/monitoring/v1/clusters/<clusterId>/indicators

The API will respond with status 200 OK and a JSON packet containing the following information:

Example: Response packet showing cluster health

The output JSON consists of:

  • type: The name of the indicator being returned. The API returns five indicator types; REPLICATION_STRATEGY and REPLICATION_FACTOR for each keyspace. DISK_USAGE for each node. PARTITION_SIZE and TOMBSTONE_LIVECELL for every table.
  • stateDetails: The state of the indicator type. stateDetails can be PASS, UNKNOWN, FAIL, WARN with further details provided in the form of a message.

A detailed description of cluster health indicators can be found in this support article:

https://www.instaclustr.com/support/documentation/monitoring-information/cluster-health-check/

Metrics

Metrics are requested by constructing a GET request, consisting of the following attributes:

classEither ‘clusters’, ‘datacentres’ or ‘nodes’.

  • ‘clusters’: Returns the metrics for each node in the cluster/s.
  • ‘datacentres’: Returns the metrics for each node belonging to the specified data centre/s.
  • ‘nodes’: Returns the metrics for the specific node/s.
UUID or public IPIf the class is set to ‘clusters’ or ‘datacentres’, then the UUID of cluster or datacentre must be specified.

Alternatively, if the class is set to ‘nodes’, then either the nodes’ UUID or public IP may be specified.

metricsThe metrics to return are specified as a comma-delimited querystring parameter. Up to 20 metrics may be specified.

For a complete list of available metrics, refer to the Reference section.

Formatted as: “metrics=<metric_1>,<metric_2>,…”

periodThe period of time from which monitoring information is returned. It is also assigned a period type.

Formatted as: “period=<period>&type=<period type>”

periodperiod type
‘latest’‘aggregate’
‘1m’Returns the most recent monitoring value.NA
’15m’Returns the most recent monitoring value.Returns the average of all monitoring results from 15 minutes ago to now.
‘1h’Returns the most recent monitoring value.Returns the average of all monitoring results from 1 hr ago to now.
‘3h’Returns the most recent monitoring value.Returns the average of all monitoring results from 3 hrs ago to now.
‘1d’Returns the most recent monitoring value.Returns the average of all monitoring results from 1 day ago to now.
‘7d’Returns the most recent monitoring value.Returns the average of all monitoring results from 7 days ago to now.
’30d’Returns the most recent monitoring value.Returns the average of all monitoring results from 30 days ago to now.
reportNaNEither ‘true‘ or ‘false‘.

If a metric value is NaN or null, reportNaN determines whether API should report it as NaN. The default behaviour is false and NaN and null will be reported as 0. Setting ‘reportNaN=true’ will return NaN values in the API response.

Formatted as: “reportNaN=<true or false>”

Request format:

https://api.dev.instaclustr.com/monitoring/v1/{{class}}/{{UUID or Public IP}}?{{metrics}}&{{period}}&{{reportNaN}

Examples:

Scenario Relevant Request Format
Return the CPU and disk utilization for each node in the cluster with a UUID of e7342f08-d32f-41af-95be-cfaa0a433a26.https://api.instaclustr.com/monitoring/v1/clusters/e7342f08-d32f-41af-95be-cfaa0a43 3a26?metrics=n::cpuUtilization,n::diskUtilization
Return the latest results of disk utilization for each node in the cluster with a UUID of 10e837bd-47a1-4e39-b7d4-5137e145491d. https://api.dev.instaclustr.com/monitoring/v1/clusters/10e837bd-47a1-4e39-b7d4-5137e145491d?metrics=n::diskUtilization&period=1m&type=latest
Return the average of read and write per second by Cassandra for each node belonging to the datacentre with a UUID of 001224dc-989c-4ad0-8b37-1ce345065b8f, from 15 minutes ago to now.https://api.instaclustr.com/monitoring/v1/datacentres/001224dc-989c-4ad0-8b37-1ce34 5065b8f?metrics=n::cassandraReads,n::cassandraWrites&period=15m&type=aggregate 
Return the list of all read latency distribution values for the ‘tcf1’ table in the ‘tk1’ keyspace, for just the 52.70.191.97 node, from 7 days ago to now, reporting NaN values as well.https://api.instaclustr.com/monitoring/v1/nodes/52.70.191.97?metrics=cf::tk1::tcf1: :readlatencydistribution&period=7d&type=range&reportNaN=true

Successfully processed metric API requests will return a 200 status code and accompanying JSON packet. JSON packets follow the same basic structure as listed in the following example:

e.g. Response with CPU Utilization for a single node

Each payload item represents an individual metric and will consist of:

metricThe name of the metric being returned
typeThe sub-type of the metric that is being measured (e.g. for the diskUsedmetric, the available ‘types’ are livediskspaceused and totaldiskspaceused)
unitThe unit of measurement.  The following unit abbreviations are used:

  • GB: Gigabyte
  • MB: Megabyte
  • B: Byte
  • s: Second
  • ms: Millisecond
  • us: Microsecond
  • 1: Non-standard unit (e.g. percentage)
  • us/1: Microseconds pre non-standard unit (e.g. latency per read operation)
  • 1/s: Non-standard unit per second (e.g. write operations per second)
valuesAn array of time/value maps containing the measurement as recorded by Instaclustr

If multiple metrics are requested, the response will include multiple payload entries:

e.g. Get CPU Utilization and Disk Utilization for a single node

Unsuccessful calls will return the following responses, depending upon the issue:

  • 400 Bad Request: Returned when the expected node or cluster ID is not a valid UUID or an incorrect metric name has been supplied.
  • 401 Unauthorized: Returned when no or incorrect username and/or API key details are provided.
  • 404 Not Found: Returned when accessing an incorrect URL or trying to access a cluster/node not owned by the authenticated user.
  • 415 Unsupported Media Type: Returned when the payload is in an unsupported format. Possibly resolved by specifying content-type as application/json.
  • 429 Too Many Requests: Returned when more than 70 requests per second are being received by your user.
  • 500 Server Error: All other errors

e.g. Error response

 

Reference

Nodes

General Metrics

Non-table metrics follow the format n::{metricName}.

Each metric type will contain the latest available measurement.

  • n::nodeStatus: Whether Cassandra is available on the node. Returns a “warn” value, if no checkin has been logged in the last 30 seconds.
  • n::cpuUtilization: Current CPU utilisation as a percentage of total available. Maximum value is 100%, regardless of the number of cores on the node.
  • n::osload: Current OS load. Generally, a node is overloaded if os load >= the number of cores on the node.
  • n::diskUtilization: Total disk space utilisation, by Cassandra, as a percentage of total available.
  • n::reads – Reads per second by Cassandra.
  • n::writes – Writes per second by Cassandra.
  • n::cassandraReads: Reads per second by Cassandra. (Deprecated, please use n::reads)
  • n::cassandraWrites: Writes per second by Cassandra. (Deprecated, please use n::writes)
  • n::compactions: Number of pending compactions.
  • n::repairs (deprecated): Number of active and pending repair tasks.
  • n::clientRequestRead:  Offers the percentile distribution and average latency per client read request (i.e. the period from when a node receives a client request, gathers the records and respond to the client). Available sub-types:
    • 95thPercentile – 95th percentile distribution of clientRequestRead
    • 99thPercentile – 99th percentile distribution of clientRequestRead
  • n::clientRequestWrite: Offers the percentile distribution and average latency per client write request (i.e. the period from when a node receives a client request, gathers the records and response to the client). Available sub-types:
    • 95thPercentile – 95th percentile distribution of clientRequestWrite
    • 99thPercentile – 99th percentile distribution of clientRequestWrite
  • n::rangeSlices – Range Slice reads by Cassandra
  • n::casReads – Compare and Set reads by Cassandra
  • n::casWrites – Compare and Set writes by Cassandra
  • n::clientRequestRangeSlice – Offers the percentile distribution and average latency per client range slice read request (i.e. the period from when a node receives a client request, gathers the records and response to the client). Available sub-types:
    •  latency_per_operation – Latency per clientRequestRangeSlices read
    • 95thPercentile – 95th percentile distribution of clientRequestRangeSlices
    • 99thPercentile – 99th percentile distribution of clientRequestRangeSlices
  • n::clientRequestCasRead – Offers the percentile distribution and average latency per client CAS read request (i.e. the period from when a node receives a client request, gathers the records and response to the client). Available sub-types:
    • 95thPercentile – 95th percentile distribution of clientRequestCasRead
    • 99thPercentile – 99th percentile distribution of clientRequestCasRead
  • n::clientRequestCasWrite – Offers the percentile distribution and average latency per client CAS write request (i.e. the period from when a node receives a client request, gathers the records and respond to the client). Available sub-types:
    • 95thPercentile – 95th percentile distribution of clientRequestCasWrite
    • 99thPercentile – 99th percentile distribution of clientRequestCasWrite
  • n::slalatency – Monitors our SLA latency and alerts when it is above a threshold level. Available sub-types:
    • sla_read – This is the synthetic read queries against an Instaclustr canary table.
    • sla_write – This is the synthetic write queries against an Instaclustr canary table.
  • n::elassandra – Monitoring metric for Elassandra cluster (only available if you have an Elassandra cluster). Available sub-types:
    • document_count – The total number of documents for the node.
    • query_per_second – Number of queries on a node calculated in the last 20 seconds.
    • index_per_second – Number of writes to the indexes calculated in the last 20 seconds.

Note: All deprecated metrics and endpoints will be removed in the future.

Table Metrics

Table metric names follow the format cf::{keyspace}::{table}::{metricType}. Optionally, a ‘sub-type’ may be specified to return a specific part of the metric. For example,

will return the various distributions of the read latency metric.

will only return the 50th percentile distribution of the read latency metric.

Each metric type will contain the latest available measurement.

  • cf::{keyspace}::{table}::readLatencyDistribution: Measurement of local read latency for the table, on the individual node. Available sub-types:
    • 50thPercentile: 50th percentile distribution of read latency
    • 75thPercentile: 75th percentile distribution of read latency
    • 95thPercentile: 95th percentile distribution of read latency
    • 99thPercentile: 99th percentile distribution of read latency
  • cf::{keyspace}::{table}::reads: General measurements of local read latency for the table, on the individual node. Available sub-types:
    • latency_per_operation: Average local read latency per second
    • count_per_second: Reads of the table performed on the individual node
  • cf::{keyspace}::{table}::writeLatencyDistribution: Metrics for local write latency for the table, on the individual node. Available sub-types:
    • 50thPercentile: 50th percentile distribution of write latency
    • 75thPercentile: 75th percentile distribution of write latency
    • 95thPercentile: 95th percentile distribution of write latency
    • 99thPercentile: 99th percentile distribution of write latency
  • cf::{keyspace}::{table}::writes: General measurements of local write latency for the table, on the individual node. Available sub-types:
    • latency_per_operation: Average local write latency per second
    • count_per_second: Writes to the table performed on the individual node
  • cf::{keyspace}::{table}::sstablesPerRead: SSTables accessed per read of the table on the individual node. Available sub-types:
    • average: Average SSTables accessed per read
    • max: Maximum SSTables accessed per read
  • cf::{keyspace}::{table}::tombstonesPerRead: Tombstoned cells accessed per read of the table on the individual node. Available sub-types:
    • average: Average tombstones accessed per read
    • max: Maximum tombstones accessed per read
  • cf::{keyspace}::{table}::liveCellsPerRead: Live cells accessed per read of the table on the individual node. Available sub-types:
    • average: Average live cells accessed per read
    • max: Maximum live cells accessed per read
  • cf::{keyspace}::{table}::partitionSize: The size of partitions in the specified table in kb:
    • average: Average partition size
    • max: Maximum partition size
  • cf::{keyspace}::{table}::diskUsed: Live and total disk used by the table. Available sub-types:
    • livediskspaceused: Disk used by live cells
    • totaldiskspaceused: Disk used by both live cells and tombstones

Listing Monitored Tables

A list of monitored tables, grouped by keyspace, can be generated by making a GET request to:

The API will respond with the following packet:

Example: Response packet listing monitored tables

Clusters

Requesting ‘cluster’ metrics returns the requested measurements for each provisioned node in the cluster and follows the same format as the ‘nodes’ endpoint. All node metrics are available for use.

For example, this request:

would return the following response packet:

Site by Swell Design Group