A number of recent of events have had us thinking more than usual about maximising availability with Apache Cassandra.
The first of these was the recent AWS S3 outage. While this outage did not have any effect on any of our managed clusters (other than few backup jobs being skipped) it obviously did affect many services on the Internet. While the lack of impact on our customer clusters was primarily due to the fact we don’t rely on S3 for a lot, the cascading impact of the S3 outage illustrates some of the difficulty in determining your actual availability exposure when using highly integrated services.
The second event was when we recently we got involved in providing emergency support to an organisation running their own cluster that had got into a serious cascading failure state. While this organisation had followed many of the basic recommendations for building a highly-available Cassandra cluster they still suffered a significant period of effective outage stemming from broader technical strategies associated with maintaining a healthy Cassandra cluster.
Both of these events got me thinking. What can our customers and others do to minimise the chance of issues like this occurring, and how can they minimise the impact if do they occur?
Firstly, let’s cover the basics of Cassandra HA. Anyone running a Cassandra cluster should understand these and if you don’t have most of these in place in your cluster you’re likely running a significant availability risk.
It goes without saying that clusters provisioned through the Instaclustr Managed Service are configured with these best practices by default:
- Run at least 2 (but probably 3) nodes. OK, that’s really basic but we have seen customers ignore this. In most cases, you’ll want to operate with strong consistency and so need at least 3 nodes. This allows your Cassandra service to continue uninterrupted if you suffer a hardware failure or some other loss of the Cassandra service on a single node (this will happen sooner or later).
- Make sure your replication factor and replication strategy is correct. There is not a lot of point in having three nodes if you have your replication factor set to one (again, obvious but again we have seen it). Replication factor of 3 is the standard for most use cases. Also, make sure you’re using NetworkTopologyStrategy as your replication strategy – it’s necessary in order to take advantage of other features such as racks and data centres. Even if you don’t think you need NetworkTopologyStrategy now, there are many reasons may you want to use it in the future and no real disadvantage to enabling it from the start.
- Use Cassandra Racks aligned to actual failure zones: Rack configuration in Cassandra is designed to ensure that the replicas of your data are stored in order to minimise the chance of losing access to more than one replica in a single incident. In AWS, map your racks to availability zones. Other cloud providers have similar concepts. In your own data center, actual racks are probably a good choice.
- Use a fault-tolerant consistency level: All the replicas in the world won’t help your availability at a consistency level of ALL. Most people use QUORUM or LOCAL_QUORUM which does provide for operations to continue with strong consistency despite a failed node. The main mistake we see here (at the small end) is not realising the QUORUM with replication factor two still requires both nodes so you have no protection against failure of a single node.
All of the points above are really the basic hygiene factors that should apply to any production Cassandra cluster unless you have carefully understood the risk and are prepared to accept it. These basic factors take care of single instance hardware failure, processes crashing on a single machine and other similar basic outage scenarios.
Following are further steps you can take to maximise availability at the architecture level and provide protection against more significant physical infrastructure level failure:
- Multi-datacentre: One possible scenario many people look to guard against is failure of a complete cloud provider region or, if on-prem, a physical data centre. Cassandra has you covered here with it’s native multi-datacenter support allowing a hot standby of your cluster to be maintained remotely with native Apache Cassandra features.
- Multi-provider: Taking the multi-datacenter scenario a step further, it is possible to run a Cassandra cluster spanning not just cloud provider regions but also multiple cloud providers (or even on-prem and cloud). Our Instaclustr Apache Cassandra managed service provides support for both multi-region and multi-provider clusters.
- Replication Factor 5: Running replication factor 3 protects you against loss of a single replica (including, with 3 racks, a full rack). However, this still leaves you in a high risk situation any time you have one replica down for maintenance. Some Cassandra maintenance operations, particularly streaming data to populate a replaced node, can take a long time (hours or even days) thus making this risk significant in some scenarios. Running a replication factor of 5 (and operating at QUORUM consistency level) means you still have protection from single-replica hardware failure, even during these maintenance operations.
There are other scenarios that can impact availability of a cluster. The first relatively common cause is overload of the cluster. This can be caused by a sudden increase in client load (particularly when exacerbated by poor retry policies – see application strategies section below) or other events in the cluster such as a dramatic increase in tombstones or a rogue repair operation.
There are three main infrastructure-level approaches to mitigate the risk of these types of issues:
- Keep spare capacity: If your cluster has spare capacity it will be more likely to be able to service an unexpected increase in load without getting into an unstable state. One example of this is that we recommend that disk usage is kept to < 70% under normal load. Similarly, 60-70% CPU utilisation is probably a reasonable metric for safe ongoing CPU levels. It’s possible to push harder if you understand your cluster but it is increasing the risk of unexpected events causing instability or even catastrophic failure.
- Prefer smaller nodes: Smaller nodes have two significant availability advantages. Firstly, maintenance operations such as replacing nodes or upgrading sstables, will complete more quickly on smaller nodes reducing the risk of an outage during such operations. Secondly, the impact of losing a single node on the overall processing capacity of the cluster is less when there are more, smaller nodes. To give a specific example, a cluster of 6 x m4.xl AWS instances will have similar (and possibly more) processing capacity than 3 x m4.2xl for the same price. However, losing a single node removes ⅙ of the capacity of the m4.xl cluster but ⅓ of the m4.3xl cluster. There is a balance here where nodes get too small and the base o/s and other overhead becomes too significant. We like the m4.xls and r4.xls as a good size production node for small to medium clusters. Running more, smaller nodes increases the need for an automated provisioning and management system such as we use in the Instaclustr Managed Service.
- Have sufficient monitoring (and pay attention!): Most times when we see a cluster in real trouble, there have been plenty of warning signs beforehand – increasing pending compactions, tombstone warnings, large partition warnings, latency spikes, dropped mutations. Paying attention to these issues and correcting them when they first appear as inconveniences can prevent them progressing to major failures.
The strategies above (with the possible exception of consistency levels) are all things that can be applied at the Cassandra architecture level relatively transparent to application logic. There are some further steps that can be applied at the application level to mitigate the risk and impacts of Cassandra issues:
- Driver configuration: The Cassandra drivers contain several features that help to mitigate against cluster outages. For example, routing queries away from slow nodes and falling back to lower consistency levels when required. Understanding these and ensuring the driver is optimally configured for your requirements can help deal with many issues.
- Retry strategies: Above the level of the driver, many applications implement a strategy to retry failed queries. Poor retry policies without a back-off strategy can be a significant problem. Firstly, queries that timeout due to issues such as a build up of tombstones will be retried putting an additional load on the cluster. Secondly, if the load gets to the point where many queries are failing, application retries will multiple the load on the cluster, until it inevitably reach a point of catastrophic failure.
- Use multiple clusters: Even if you’ve applied all the measures above, there is still a chance (however small) that something will cause your cluster to fail. Particular risks include application bugs or human error by someone with administrator-level privileges (of course, there is plenty you can do to reduce these risk but that’s outside the scope of this article). A key mitigation against any of these remaining risks is to use multiple clusters. Consider, for example, a global application where presence in 6 regions is desired. You could build a single cluster with six data centers. However, architecture still means that a single action could affect your entire global cluster. Building 3 clusters of two data centers each would not entirely protect from this kind of disaster but does mean that a global Cassandra outage is effectively impossible (possibly at the cost of increased latency for users who move around the globe). Similarly, splitting your cluster by application function may be possible – for example one cluster for raw data ingestion and one for store and retrieving analytics results from that raw data. This pattern should mean that part of your application will continue to function (at least for a period) even if one cluster fails and complete application outage as a result of Cassandra failure is next to impossible.
Many leading Internet services have shown that, by deploying strategies like these, it is possible to achieve extremely high levels of availability with Cassandra. At Instaclustr, we see a big part of our job as making these levels of availability easy and accessible so you can focus on building your application rather than spending time and effort at the infrastructure and operations layers. Visit our Managed Solutions and Cassandra Consulting page to learn more.