Avoiding range slice issues in Cassandra

Certain types of query in Cassandra will lead to performing an expensive operation known as a range slice. Under some circumstances, range slices can cause high latency, long GC pauses, and node instability. This article provides advice for identifying and minimising the impact of range slices.

What is a range slice?

Certain types of query in Cassandra will lead to performing an expensive operation known as a range slice. A range slice involves querying the data in an unusual and potentially difficult way.

Range slices may be tempting for several reasons:

They can retrieve data in a way that is otherwise difficult (eg, looking for partition keys based on the value of a clustering key).
They can return multiple partitions, so a single query can retrieve data that would normally require several queries.
They can be used for scanning whole tables.

However, range slices are one of the most common causes of high latency and timeouts in Cassandra. The coordinator may be forced to hold onto a lot of data before returning the query, leading to long and frequent GC pauses. This increase in GC activity can cause high latency and timeouts for other queries running on the cluster. In some cases, range slice queries can crash nodes or make the cluster unstable.

Where possible, we recommend avoiding range slices in production clusters.

How do I know when I am performing range slices?

For clusters running in our managed service fleet, range slice traffic can be found in the metrics graphs on our console – it is one of the metrics shown in the ‘Client multi partition reads and latency’ section. Alternatively, in our monitoring API, the endpoint is ‘n::rangeSlices’.

If you are running Cassandra in your own environment, the metric endpoint is available via JMX. It is a client request metric – the mbean is org.apache.cassandra.metrics:type=ClientRequest, scope=RangeSlice, name=Latency, attribute=Count. Despite the name of the mbean, this attribute is a total count of the number of range slices performed since Cassandra started; it has nothing to do with latency.

It is also possible to find out whether you are performing range slices by examining your queries. A query will be processed as a range slice if it it meets any of the following criteria:

no partition key in the WHERE clause
the IN operator is used on a column that is not a partition key
the TOKEN function is used.

As an example, consider a table with the following schema.

CREATE TABLE IF NOT EXISTS demo.users (
name text,
age int,
data text,
PRIMARY KEY ((name), age)
);

When run on this table, the following queries would all be processed as range slices:

SELECT * FROM demo.users
(no partition key in WHERE clause)
SELECT * FROM demo.users WHERE age > 50 ALLOW FILTERING
(no partition key in WHERE clause)
SELECT * FROM demo.users WHERE (age) IN ((29), (30), (31)) ALLOW FILTERING
(IN statement used on non-partition key)
SELECT * FROM demo.users WHERE token(name) > 15535
(TOKEN function used)

Several of these examples require the ALLOW FILTERING clause. This is an additional problem – the performance of ALLOW FILTERING queries are affected by the amount of data returned by the query (eg. the proportion of items filtered out by a WHERE clause on a non-partition key). As the composition of the data stored in the cluster changes over time, ALLOW FILTERING queries may suddenly start to cause severe performance issues, despite no application code changes and no visible changes in metrics or logs.

How can I avoid problems with range slices?

Where possible, the ideal approach is to stop using range slices. This usually involves:

splitting up one range slice query into multiple queries that hit one partition each, or
creating a new table with different partition keys so that the necessary data can be easily read via partition keys.

There are a few situations where it is hard to avoid using the TOKEN function, and so range slices cannot be avoided. This is usually where it is necessary to iterate through a table and read all the data – eg, migrating data with a custom script, or performing queries with the Spark Cassandra Connector that involve scanning a whole table. In these cases, it may not be possible to move away from range slices entirely, but setting a LIMIT clause with a lower number can reduce the amount of data that is fetched in each query. Throttling the rate at which these queries are performed can also help.

Range slices interact badly with large partitions and high numbers of tombstones, so keeping the rest of your data model healthy can also help limit the performance impact.

By Instaclustr Support