The last major version release of Apache Cassandra was 3.11.0 and that was more than 2 years ago in 2017. So what has the Cassandra developer community been doing over the last 2 years? Well let me tell you, it’s good, real good. It’s Apache Cassandra 4.0! The final release is not here as yet, but with the release of the first alpha version, we now have a pretty solid idea of the features and capabilities that will be included in the final release.
In this series of blog posts, we’ll take a meandering tour of some of the important changes, cool features, and nice ergonomics that are coming with Apache Cassandra 4.0.
The first blog of this series focuses on stability and testing.
Apache Cassandra 4.0: Stability and Testing
One of the explicit goals for Apache Cassandra 4.0 was to be the “most stable major release of Cassandra ever”
As those who’ve run Cassandra in production know it was generally advisable to wait for up to 5 or 6 minor versions before switching production clusters to a new major version. This resulted in adoption only occurring later in the supported cycle for a given major version. All in all, this was not a great user experience, and frankly a pretty poor look for a database which is the one piece of your infrastructure that really needs to operate correctly.
In order to support a stable and safe major release, a significant amount of effort was put into improving Apache Cassandra testing.
The first of these is the ability to run multi-node/coordinator tests in a single JVM (https://issues.apache.org/jira/browse/CASSANDRA-14821). This allows us to test distributed behavior with Java unit tests for quicker, more immediate feedback. Rather than having to leverage the longer running, more intensive DTests. This paid off immediately identifying typically hard to catch distributed bugs such as https://issues.apache.org/jira/browse/CASSANDRA-14807 and https://issues.apache.org/jira/browse/CASSANDRA-14812. It also resulted in a number of folk backporting this to earlier versions to assist in debugging tricky issues.
Interestingly, the implementation is a nice use of distinct Java class loaders to get around Cassandra’s horrid use of singletons everywhere and allows it to fire up multiple Cassandra Instances in a single JVM.
From the ticket: “In order to be able to pass some information between the nodes, a common class loader is used that loads up Java standard library and several helper classes. Tests look a lot like CQLTester tests would usually look like.
Each Cassandra Instance, with its distinct class loader is using serialization and class loading mechanisms in order to run instance-local queries and execute node state manipulation code, hooks, callbacks etc.”
On top of this, the community has started adopting Quick Theories as a library for introducing property based testing. Property based testing is a nice middle ground between unit tests and fuzzing. It allows you to define a range of inputs and test the test space (and beyond) in a repeatable and reproducible manner.
Currently, in trunk there are two test classes that have adopted property based testing: EncodingStatsTest and ChecksummingTransformerTest. However community members are using it in their own internal validation test frameworks for Cassandra and have been contributing bugs and patches back to the community as well.
Moving beyond correctness testing, a significant amount of effort has gone into performance testing, especially with the change to adopt Netty as the framework for internode messaging. So far testing has included, but definitely has not been limited to:
- https://issues.apache.org/jira/browse/CASSANDRA-14765 (Node recover time)
- https://issues.apache.org/jira/browse/CASSANDRA-15175 (Large cluster performance)
Probably the best indication of the amount of work that has gone into testing of the Netty rewrite can be seen in 15066 https://issues.apache.org/jira/browse/CASSANDRA-15066 and is well worth a read if you are into that kind of thing.