One of the challenges we are constantly tackling at Instaclustr is how to effectively monitor clusters, and particularly read and write latency, given the wide variety of use cases that customer use our service for. Our recent introduction of synthetic transaction monitoring adds significantly to our capability to reliably monitor cluster performance regardless of use case.
Synthetic transaction monitoring
Synthetic transaction monitoring is a reasonably common monitoring technique where an automated system is set up to to regularly replay the same transaction and compare the performance of that transaction against a known baseline performance. This has some key advantages for any system versus monitoring based just on observed user transactions:
- There is a known baseline performance with variation only caused by performance in the system being monitored, not variance in the types of transactions. This allows alerting rules to be tuned for significantly smaller variances.
- The system produces a reliable, consistent level of transactions. Monitoring based on user transactions can be difficult where transactions are sporadic, resulting in need to average over a wider scope (and loss of fidelity).
- The regular nature of synthetic transactions make it easy to distinguish between “the system stopped processing transactions because the source stopped sending them” and “the system stopped processing transactions because it is broken (in some otherwise undetectable way)”.
In our circumstances running a managed service synthetic transactions have the additional advantage that it allows us to establish and monitor baseline performance regardless of customer use case. For our customers running Cassandra as an operational system, reads in the 10-20ms latency are the norm. For more analytical usages it’s not unusual to see latencies 5-10 times that. Sometimes this range can be seen on different tables in the same cluster or even on the same table at different times of the day (when batches run). Using synthetic transactions allows us to establish a baseline that is consistent across all clusters regardless of customer use case and monitor this baseline closely.
The synthetic transactions we use are fairly simple (but very reliably test the proper functioning of Cassandra). We have a simple table setup with replication factor three in each data center. Each node reads and writes to that table at LOCAL_QUORUM consistency level 30 times every twenty seconds (1.5 reads/writes per node per second is a trivial load to the 10,000s of simple transactions per second a node can support). The success and average duration of read and writes over the thirty second period is reported to our monitoring system for display on our dashboard and use in alerting.
It’s worth noting that synthetic transactions are an addition to our existing monitoring rather than a replacement. We monitor read and write latency across all customers keyspaces (at relatively high levels to allow for use case variations) and for enterprise level customers can establish custom alerts for individual tables tuned to their specific use case. We also have plans to continue to enhance our capabilities in this area, particularly to provide better coverage for customer tables irrespective of variance in use cases.