We often get questions from customers about the best way to add capacity to their cluster. Is it better to add nodes, or simply to increase the capacity in their nodes? Unfortunately, the truth is there is no best way—like all complex issues in distributed systems, there are benefits and drawbacks to each scaling approach.
There can be many reasons for needing to scale your cluster. However, we have found that the 2 most common are:
- Not enough disk capacity
- Not enough processing capacity
What is Horizontal Scaling?
Horizontal scaling, or scaling out, is the act of adding more of the same size nodes to your cluster. For example, this is going from a 6-node cluster to a 9-node cluster. This form of scaling can be broken up into 3 stages:
Before Adding Nodes: At this stage, you are running a 6-node cluster. You may realize that you need to increase capacity, so you make the decision to add 3 nodes to your cluster.
Re-Streaming Data: During this stage you have an additional 3 nodes coming online to your cluster to add capacity. However, before these new nodes can begin servicing requests, they first need to stream the relevant partitions of data onto their own disk. Cassandra refers to this stage as “Joining”. Depending on the amount of data in your cluster, and size of your nodes, this can take hours or days to complete.
The major thing to consider during this stage is that although you have added 3 nodes to your cluster, you actually have reduced processing capacity! That’s because your original 6 nodes are still servicing all your requests, but now also servicing the streaming requests from the joining nodes! It’s for this reason that we never recommend scaling horizontally on a cluster which is under a high processing load.
Cleaning Up Old Data: At this stage, you have copied all your data to your new nodes, and they are now servicing both read and write requests from your application. However, your original nodes are now storing an additional copy of the data that they streamed to the newly joined nodes! We need to reclaim that used disk space. In Cassandra, this is known as a “Cleanup” and will need to be run in order to get back that disk space.
Scaling Back Down: If you decide later that you wish to reduce the capacity of your cluster back down to 6, then you will need to run a decommissioning process. This process involves ensuring that the copies of data that are on your node you wish to remove are relocated to the correct remaining nodes. Depending on the size of your cluster, and the amount of data, this can take hours to days to complete.
What is Vertical Scaling?
Vertical scaling, or scaling up, is when you increase the size of the nodes in your cluster, providing increased processing capacity, or disk size. This can be broken up into 4 main stages.
Before Increasing Capacity: At this stage, you are running a 6-node cluster. You again realize that you need to add capacity and decide to double the storage and processing capacity on each node. You will perform the next 2 steps one by one on each node:
Taking the Instance Down (Instance Increase): If you are increasing the instance size you are utilizing then you will first need to stop the instance before you can resize it in your chosen cloud provider. It’s again worth highlighting that at this stage you will be reducing both processing and disk capacity in your cluster while the instances are down.
If you do wish to only increase the size of your storage, this can be achieved without taking down the application, which does not result in any reduction in processing capacity.
Performing the Resize: During this stage you will resize the instance and/or disk, in your chosen cloud provider. Once that has been completed, you will need to restart the instance and ensure the cluster is healthy before moving onto the next instance until they are all completed. Depending on the application, you may need to modify some of the application settings to utilize the additional resources.
Generally, you can complete the vertical scaling in around 15 minutes per node.
It’s worth highlighting that if you are using an instance type with attached storage, vertical scaling generally requires replacing each instance with a larger size, and re-streaming the data. We have implemented the Advanced Node Replace mechanism, which is able to reduce the performance impact during these operations for instance store nodes running Cassandra.
Scaling Back Down: If you decide later that you wish to reduce the capacity of your cluster back down to the smaller instance size, it depends on if you need to reduce the instance size or the disk size.
Instance Only: If you are only reducing the size of the instance and not the size of the disk, you can essentially repeat the process, but instead of increasing the instance size, you can decrease it.
Reducing Disk: Reducing the size of a disk is not as straightforward—some cloud providers will not let you reduce the size of networked disk storage. If you need to scale the disk down also, then you will need to essentially replace your node, which usually involves a re-streaming process moving around a large amount of data.
Horizontal vs vertical Scaling – Which Is Right for You?
Both horizontal and vertical scaling are great options, and both suit different use cases.
Vertical scaling is great when you need to quickly add capacity to a cluster. When you only increase the size of the instance, without modifying the underlying storage, you also retain the ability to quickly reduce the cluster back down in size. Vertical scaling is what our Customer-Initiated Resize feature automates for customers. We recommend vertical scaling for short-term or seasonal peaks in processing requirements. This is because after the operation is completed, you can reduce the size of your cluster quickly.
Additionally, when running on smaller instances (generally around the m5.large and below instance size), we would always recommend scaling vertically before scaling horizontally.
The operating system and other tooling on instances carry an overhead which is larger compared to the processing capacity on the smaller sizes. So you could expect more than double the performance by doubling the instance size.
On the other end of the spectrum, nodes with storage that are too large can cause operational issues during periods of maintenance, as when an instance does fail, you are required to re-stream a large amount of data. Additionally, because you have fewer nodes, you are likely to put a large strain on all the other replicas in the cluster to stream the data.
Horizontal scaling is great for long term baseline increases, where you are unlikely to need to reduce the size of your cluster. This could be due to increased usage of your application, change in workload, or other factors. By having more instances with smaller storage, you reduce the amount of data that has to be re-streamed in the event of an instance failure. This means that when an instance does fail, you are able to bring your application online much faster, as it is able to stream from more replicas. If you’re an Instaclustr customer using our Managed Platform, you are able to add nodes to your cluster via the Console.
|Vertical Scaling||Horizontal Scaling|
|Time To Scale Up||Faster (15 minutes per node)||Slower (hours to days per node)|
|Time To Scale Down||Faster (15 minutes per node) if no disk reduction||Slower (hours to days per node)|
|Operational Considerations||Storage dense nodes can cause longer recovery periods, and more reduced performance, after maintenance||Should always scale vertically first if using small instance sizes|
|When to Start Scaling||Before your cluster reaches capacity||Before your cluster reaches capacity|
|Best Use Cases||Short term increased loadRequirement to scale quickly||Sustained increased load|
No matter which scaling method you use, we always recommend scaling up the capacity of your cluster well before it is required. This is because any scaling method will reduce the processing capacity of your cluster, which can significantly overload an already strained cluster.