By Anup Shirolkar Tuesday 31st October 2017

Apache Cassandra Scalability: Allow Filtering and Partition Keys

Technical Technical — Cassandra

The ‘ALLOW FILTERING’ clause in Cassandra  CQL provides greatly increased flexibility of querying. However, this flexibility comes at a substantial performance cost that should be aware of before using ‘ALLOW FILTERING’.  This post explains the costs and benefits of ALLOW FILTERING.

The data storage and query operations in Cassandra work on top of partitioned data which is similar to most of other Big Data management systems. As most people would be familiar with from relational databases, each table definition has a primary key defined which can be a single attribute or combination of 2 or more. One or more attributes from the primary key are used to specify the ‘partition key’ for the table.

Example:

Here, the primary key is composed of (machine, cpu,mtime) out of which (machine, cpu) are used as partition key and (mtime) is used as clustering key.

The purpose of a partition key is to split the data into partitions where an entire partition is stored on a single node in the cluster (with each node storing many partitions). When data is read or written from the cluster, a function called Partitioner is used to compute the hash value of the partition key. This hash value is used to determine the node/partition which contains that row. The clustering key is used further to search for a row within a given partition.

Select queries in Apache Cassandra look a lot like select queries from a relational database. However, they are significantly more restricted. The attributes allowed in where clause of Cassandra query must include the full partition key and additional clauses may only reference the clustering key columns or a secondary index of the table being queried.

Requiring the partition key attributes in the where helps Cassandra to maintain constant result-set retrieval time as the cluster is scaled-out by allowing Cassandra to determine the partition, and thus the node (and even data files on disk), that the query must be directed to.

If a query does not specify the values for all the columns from the primary key  in the ‘where’ clause, Cassandra will not execute it and give the following warning :

InvalidRequest: Error from server: code=2200 [Invalid query] message=”Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING” ‘

You can use execute queries that use a secondary index without ALLOW FILTERING – more on that later.

The reason behind this warning is that when the complete partition key is not included in the WHERE clause, there is no way for Cassandra to identify the node which contains the required results, and thus it will need to scan the complete dataset on each node to ensure it has found the required data. Hence, the query execution time is not proportional to the size of the result returned but rather to the total amount of data stored in the referenced table. One implication of this, other than initial poor performance, is that query performance is likely to get worse as your cluster scales (unlike partition key queries which exhibit constant response times while scaling)  

As I mentioned early, you can also execute queries that reference a secondary index without specifying a partition key value or using ALLOW FILTERING.

Let’s take an example to understand this:

Consider the following table holding user profiles with their year of birth (with a secondary index on it) and country of residence:

CREATE INDEX ON users(birth_year);

Then the following queries are valid:

Query 1: SELECT * FROM users;

Query 2: SELECT * FROM users WHERE birth_year = 1981;

However, the following query will be rejected:

Query 3: SELECT * FROM users WHERE birth_year = 1981 AND country = ‘AU’;

This is because Cassandra cannot guarantee that it won’t have to scan a large amount of data even if the result of the query is small. Typically, it will scan all the index entries for users born in 1981 even if only a handful are actually from Australia. However, if you “know what you are doing”, you can force the execution of this query by using ALLOW FILTERING and so the following query is valid:

Query 3 (with ALLOW FILTERING):SELECT * FROM users WHERE birth_year = 1981 AND country = ‘AU’ ALLOW FILTERING;

When using secondary indexes, it is important to understand that all three of these queries will need to hit every node in the cluster (or at least a complete set of replicas) in order to return results. In queries like this in Cassandra, one node in the cluster will act as the coordinator node, send out the query to the other nodes in the cluster that need to participate and collating the results to send back to the client. For Query 2, the coordinator will need to send a subquery  to each node in the cluster which in turn will look up the index for the data it stores and return matching rows to the coordinator which in turn returns them to the client. Query 3 is executed in much the same manner but the individual nodes not only look up the index and return data but also filter it before returning.

Thus for Query 3 the size of the result set is not related to the amount of data scanned and ALLOW FILTERING is required. However, both Query 2 and Query 3 require all nodes in the cluster to participate in the query and thus the amount of work required to complete the query increases as the cluster size grows and you will not see the linear horizontal scaling that Cassandra is famous for.

So, while ALLOW FILTERING should definitely be avoided, any query that does not specify a partition key should also be avoided to enable your Cassandra cluster to scale.

Site by Swell Design Group