For quite some time, Instaclustr has been tuning the number of vnodes that we use when deploying large clusters. Recently, we have extended this to make 16 vnodes the default for all new Cassandra 3+ clusters deployed. This blog post explains the background and benefits of this change.
The concept of virtual nodes (otherwise known as vnodes) has been a major feature of Apache Cassandra since it was introduced in version 1.2, back at the start of 2013. As far as major new Cassandra features go it has been very successful, however, a full understanding of the impact on real production clusters (and therefore the best practices regarding usage) has evolved over time.
Vnodes were originally implemented to solve several overlapping problems caused by only having a single contiguous token-range per node:
- For any given replication factor r, there were at most only r nodes available to stream copies of each node’s data. This limited the maximum throughput for tasks like repair or when bootstrapping new nodes. When a node is offline or inconsistent we generally want to get it to a consistent state as quickly as possible.
- Adding (and removing) nodes to an established cluster would either require a series of expensive token movements or require the cluster to double (or halve) in size to ensure it remained balanced.
- It was not a trivial task to tune the load amongst nodes in order to take advantage of different (more powerful) hardware on some nodes. This required some educated guesswork, some trial-and-error, and a lot of time.
The idea of splitting each node’s token-range into multiple smaller ranges (virtual nodes) and spreading these around the ring has mostly proven to be an effective solution to those problems. Each node can now be responsible for multiple token-ranges, instead of only one. This means that a single node now shares replicas with a much larger proportion of nodes within the cluster, and is able to stream data to many other nodes, rather than only a limited few.
To ensure that deployment remained simple, a token generation algorithm was implemented that randomly generated the set of tokens for each node when that node bootstrapped (alternatively, it is also possible to set manually-crafted token values). A default value of 256 tokens per node was chosen in order to ensure that the randomly generated tokens would provide an even distribution of the token space. This value can be configured via the num_tokens option.
As time went by, it was discovered that vnodes also caused some negative performance impacts in certain cases and that they don’t always ensure that the cluster is evenly balanced. The large number of vnodes required to achieve a balanced cluster introduces performance overheads on many operational tasks (such as repair) and can increase overheads when using analytical tools like Apache Spark or Hadoop map-reduce with Cassandra, causing some tasks to take more time to complete. Similar performance overheads have been encountered when using search engines like Cassandra-Lucene-index, DSE Search, or Elassandra with Cassandra. In some use-cases these overheads are not significant enough to trade-off the consequences of not using vnodes, however, in many cases, the impact can be very large. Reducing the number of vnodes can help in this situation, however, with older Cassandra versions this often caused the cluster to become more unbalanced.
The biggest problem with vnodes appears once a cluster grows to a relatively large number of nodes. As the size of a cluster increases, so does the chance of hot spots – nodes that happen to own more token-space than the others – forming around the cluster. Since the number of vnodes cannot be easily modified once a cluster is running (especially on a production cluster), choosing the optimal number of vnodes when first deploying a cluster can be an important decision.
A new approach
In Cassandra 3.0, this issue was addressed with a new approach to vnode allocation. Rather than allocating tokens randomly, new tokens may instead be allocated such that the largest token ranges are split, with the goal of reducing hot spots. This new approach is optional and requires that a target keyspace is specified. The replication factor of the target keyspace is used to optimise the placement of new tokens. To achieve the best results, all keyspaces that hold a large amount of data should use the same replication factor as the target keyspace.
While the main goal of the new allocation algorithm is to reduce token-space imbalance for larger clusters, it also reduces imbalance for smaller clusters. This allows the number of vnodes to be significantly reduced, while still achieving a similar level of balance.
We have performed testing on both moderate sized clusters (20-500 nodes, with multiple racks and multiple DCs) and smaller clusters, and have found that a num_tokens value of 16 is optimal for the majority of use cases when used in conjunction with the new allocation algorithm. This appears to be the smallest value which still provides an acceptable distribution of the token-space. When using the old random-allocation algorithm (which remains the default), we recommend that the default value of 256 is used in order to achieve a balanced token allocation.
In our testing, we found that a variance of roughly 15% between the nodes with the smallest and largest token space allocation can be expected when using a num_tokens value of 16 along with the new allocation algorithm. Moving up to a num_tokens value of 32 provided only a modest decrease, producing a variance of 11%, whereas a cluster with only 8 tokens would have a significant increase with 22% variance.
An interesting anomaly was discovered when testing with a num_tokens value of 8. The balance variance was smaller for small clusters of 6-12 nodes and increased steadily up to 22% for a 60 node cluster. It did not appear to rise from this point onwards, as this value remained steady for a 120 node cluster. This could be due to the fact that with only 8 tokens per node there is still a high chance that new token ranges will be unbalanced when adding nodes to smaller clusters.
All Cassandra clusters recently deployed with Instaclustr are already using the new algorithm with a num_tokens value of 16, and this will remain the default configuration for all new Instaclustr clusters going forward. The target keyspace used for the allocation algorithm is automatically created based on the target replication factor chosen at cluster creation. Existing managed clusters will remain using the default randomised allocation algorithm as change this setting would require a full data center migration.