• Apache Cassandra
  • Technical
Speculative Query Executions with GoCql

Speculative query executions in Cassandra are an interesting solution for some use cases, such as faulty/slow/unresponsive nodes or network issues. This kind of execution allows a client to launch a DB request for multiple endpoints at the same time and let these requests compete for the winning response. This would only work if the query itself is defined as idempotent, that is, it renders the same result no matter how many times it is run. We’ve written about idempotence and speculative query execution before, so head over https://www.instaclustr.com/query-idempotence-in-gocql-driver/ to refresh your memory if needed.

In most cases, using speculative query execution would not lead to performance improvement in normal execution (even though in some edge cases it might), but improve the reliability of queries to get a response from the server. In other cases, it might improve overall execution time for queries, as getting the response from the fastest node does save time. One should also remember, that while this can improve the reliability of the queries or overall execution time, it will be at the cost of using more resources (mostly CPU/Network) to do so.

A few use cases for using speculative queries:

  1. A node you are querying is down and your SLAs require a response from the server within a timeframe that is lower than Cassandra timeout.
  2. A node is flaky, leading to inconsistent response times or dropped queries.
  3. A node is returning timeout errors requiring a client application to retry the query on another node.

Speculative query execution has been a feature of the Java driver for quite some time now, and we have recently included similar functionality into the GoCql driver. As mentioned above, it allows a user to run the same query on multiple hosts at the same time, letting the executions to compete for the winning spot. The first query to get a response will win and be returned to the client.

In order to use the speculative query execution with this change, one must define the query as Idempotent and provide an execution policy:

As can be seen from the example above, we’ve used a SimpleSpeculativeExecution policy, which is the one implemented in the driver. It is a very simple policy, one that defines the number of additional executions (that is in addition to the original request), and the constant delay between these executions. One could implement their own policy easily for example, to have a policy that pauses incrementally longer between additional executions, one could build the following policy:

And then use it in the query execution:

To show an example for using speculative query executions, we’ll use a 3-node Cassandra cluster. The use case that we’re going to explore is going to be a slow node, which we are going to simulate using a simple tc tool that comes as a part of the iproute2 package. Our example is going to be a bit extreme, but hopefully, it conveys the idea of when speculative queries might be useful.

To simulate a slow node, run the following command on one of the nodes:

This adds 250ms delay to all outbound packets for the given physical device (eth0 in the above case). Then we use the following client code to run the tests:

 

This code is also available at https://github.com/instaclustr/sample-GoCql-Speculative-Execution.

This client code will insert 10000 entries into the cluster. As we’re using random numbers for the key column (id), all the queries are expected to be distributed more or less evenly among the nodes. Now, when we start the cluster and execute the client, we notice the following:

So it takes about half an hour to run the queries because one of the nodes has a constant delay of about half a second. When running with speculative execution mode, we get:

Not only we don’t ever see the “problematic” node responding to the query, but all queries are also split between the 2 other “fast” nodes, taking only about a 1.5 minutes to complete.

As for the overhead of this improvement, there’s a jump in open sockets while running in speculative mode vs non-speculative:

Open sockets while running in speculative mode vs non-speculative

As can be seen, the number of open sockets jumps from about 147 to 153 (about 5% jump). There are no significant increases in CPU, memory or io utilisation.

This might be considered a good trade-off, but remember that this is what happens on a single, very simple client on a performant host (m4-2xlarge) that only ran this test. In other use cases, such as smaller instance, busy client, or client running many queries in parallel, this may have a more significant effect on system resources utilisation. As a result, it is highly recommended to test the benefits of using speculative queries in your pre-production environment before deploying in production.

After this feature was released, a few issues were identified and have already been fixed, thanks to the community of contributors. It is in good shape now and is ready for general use. Enhancements planned for this feature include running the query observer (useful for monitoring query executions) in the main goroutine, which will make it more convenient to debug failures in the client code.