News Thursday 23rd September 2021

Updating Instaclustr’s Open Source Tooling for Apache Cassandra 4.0

By Stefan Miklosovic

With Apache Cassandra 4.0 recently released, we have updated the open source tooling to support Cassandra 2.0 through 4.0. In some cases that merely required the tool to be tested with Cassandra 4.0, but in others, significant refactoring of the code has occurred to support all three versions. Below I walk through all the different tools that have been updated for Cassandra 4.0. If you want to find out more about what’s new in this release you can read this blog from our CTO Ben Bromhead which pulls together his series of articles on Cassandra 4.0.

Cassandra Lucene Index Plugin

We spent a great deal of time updating the Cassandra Lucene plugin to Cassandra 4. Cassandra Lucene plugin brings Lucene queries to Cassandra, which is a very powerful way to get maximum insights from your data. Lucene plugin implements Cassandra’s secondary index.

Reviewing all the changes required to achieve this would take more than a blog post to enumerate. But we can proudly say that right now you may use Lucene queries against a Cassandra 4.0 cluster.

The most time consuming refactoring was around package names and right imports of classes which were changed between Cassandra 3 and 4. There had to be changes done in regard to serialization and deserialization of collections—maps, lists, and sets and recursive cases for that (like maps in a map, lists in sets, and so on—you can search on these too). There were also changes related to the initialization of the secondary index plugin.

We have also updated Scala to 2.13 and greatly improved testability as the plugin is right now fully self-contained when it comes to tests and does not need any external runtime dependencies. This greatly simplifies the development as the plugin is testable by standard JUnit 5 tests, where a Cassandra node to run a plugin is started as part of the test suite automatically.

Esop: Cloud-Enabled Backup and Restore Tool for Apache Cassandra

Esop is a tool for Apache Cassandra backup and restore. It is open sourced under the Apache 2.0 License and you can find it on Github. Esop has a comprehensive set of backup and restore features for Apache Cassandra in a diverse set of environments:

  • Backup and restore of Cassandra SStables and commit logs
  • Live backup as well as restore on a running cluster and restoration of a node which is stopped
  • Compatible with AWS S3, GCP, Azure, Ceph, and Oracle storage
  • The tool is aware of what is already stored and can perform incremental backups
  • Backup bandwidth is able to be throttled
  • Backup management with configurable periodic removal of older backups
  • Supports renaming of table to be restored from backup and cross-keyspace restoration
  • Supports backups in cluster-wide or per-dc manner
  • Verification of downloaded data
  • Takes care of details like initial_token, auto bootstrapping

Icarus: Sidecar for Cassandra With Integrated Backup/Restore

Icarus is a cloud-ready and Kubernetes-enabled sidecar for Apache Cassandra. Icarus is used as a sidecar in Instaclustr’s Kubernetes operator and Orange’s Kubernetes operator for Apache Cassandra. It pairs with Esop (above) to enable backup and restore operations for Apache Cassandra when deployed on Kubernetes.

This update of Icarus to run against Cassandra 4.0 was rather easy as the communication from Esop to Cassandra is done via JMX so there was nothing to be changed. One thing is different from Cassandra 3 in this case though. In Cassandra 4.0 you can import your SSTables, living somewhere on disk locally by JMX, on the fly. One minor drawback of this feature in times of RC releases of Cassandra 4.0 was that loading of such SSTables would effectively move them instead of copying them so you would lose your backup after you load it. This was addressed by our engineers in CASSANDRA-16407, where copying capability was introduced.

Instaclustr Minotaur: Cluster Rebuilding

Instaclustr’s Minotaur is a command line tool for Apache Cassandra cluster rebuilding. Minotaur is a tool that rebuilds clusters via read-repairing. With standard rebuilds Cassandra does not have any special logic with regards to which replica it streams from. So in the case of joining and rebuilding a DC, all replicas in the new DC might stream from the same replica in the source DC. This means that if the source replica is inconsistent then the new DC will also be missing that data.

Without the tool in order to ensure all data is streamed from source DC to new DC it would be necessary to run a repair, which significantly increases the amount time it takes to complete a cluster migration. Minotaur aims to ensure that each replica in the new DC streams from a different replica in the source DC. In that way the new DC will at least have the same level of consistency as the DC it is streaming from.

Being sure this tool works with Cassandra 4.0 was very easy as we talk to a node via CQL only by standard Cassandra Java driver.

Cassandra TTL Remover

Instaclustr’s Cassandra TTL Remover is a tool that enables you to remove data annotated as temporary with a Time-to-Live (TTL) from backups. Many data layer technologies allow ephemeral data to be annotated with a TTL so that it will expire after a given time period. However, if you later decide that you don’t want the data to expire, you need a way to remove the TTL from the record (unless you want to rewrite all the data). That’s where this tool comes in. We used to create a static test data set for analytics development and also for recovery in an environment where a TTL was accidentally set and not found until data started being dropped.

Upgrading this tool from Cassandra 3 to Cassandra 4.0 was very easy. The code around this feature differs most at imports/package level but functionally it is very similar logic.

Cassandra SSTable Generator: Tool for Programmatic SSTable Generation

The Cassandra SSTable Generator is a command line tool for Apache Cassandra that allows you to generate SSTables programmatically so they can be bulk-loaded into Cassandra keyspaces. You can, for example, implement a custom generator that enables the generation of test data for Apache Cassandra. As for other tooling, updating this tool to work with Cassandra 4.0 was seamless and an end user will not notice any difference from the tool’s perspective.

It’s a great pleasure to be able to work on these open source tools for the whole community of Apache Cassandra users. To get started with Cassandra 4.0, start your free trial of our Managed Cassandra offering today.