• Apache Cassandra
  • Technical
Cassandra Collections: Hidden Tombstones and How to Avoid Them


Multi-value data types (sets, lists and maps) are a powerful feature of Cassandra, aiding you in denormalisation while allowing you to still retrieve and set data at a very fine-grained level. However, some of Cassandra’s behaviour when handling these data types is not always as expected and can cause issues.

In particular, there can be hidden surprises when you update the value of a collection type column. For simple-type columns, Cassandra performs an update by simply writing a new value for the cell and the most recently written value wins when the data is read. However, when you overwrite a collection Cassandra can’t simply write the new elements because all the existing elements in the map have their own individual cells and would still be returned alongside the new elements whenever a read is performed on the map.

The options

This leaves Cassandra with two options:

  1. Perform a read and discover all the existing map elements and either delete them or update them if they were specified in the overwrite.
  2. Forget about all existing elements in the map by deleting them.

Option 1 doesn’t sound very optimised, does it? A read for every write you perform? Ouch.
Cassandra chooses option 2 because it just can’t resist those performance gains. It knows you’re performing an overwrite, and that you obviously don’t care about the contents of those columns, so it will delete them for you, and we can all pretend they never existed in the first place.

Or so we thought… until one day your queries start failing because you’ve hit 100k tombstones. Didn’t expect that, especially when you never delete any data.
In most cases, compactions will just handle this problem for you and the tombstones will be gone before you even get close to the query failure limit. However, compaction strategies aren’t perfect and depending on how much you overwrite, plus how well compactions remove those tombstones, there are many cases where this behaviour can become a huge issue. If you are performing many writes, and all of them are overwrites where a collection type is involved, you will be generating a tombstone for every single write.

Examples for avoiding the issue

I’ve created a very basic schema with a map and a few fields, as below:

I then inserted a single row and performed a flush:

And I now have an SSTable in my tombs.staff data directory.

Using sstable2json to analyse the data, as expected we have one key, a, however it has two locations entries, despite the fact we only did one write.

This is to do with the map, and the whole overwrite thing I was talking about earlier. Already we can see that C* has written a range tombstone for the locations cell immediately before writing the value that I inserted.

Now, this is kind of a spoiler, as we haven’t actually done any “overwrites” yet, but we’ve identified the feature we’re talking about. This is because in Cassandra, overwrites, updates, and inserts, are really all just the same thing. The insert against the map will do the same thing whether the key already exists or not.

Anyway, we can see how this delete first strategy begins to work if we simply insert another record with the same key:

We now have 2 sstables: tombs-staff-ka-1-Data.db and tombs-staff-ka-2-Data.db. And if we run sstable2json on the new SSTable, we see a very similar entry:

Nothing surprising, and furthermore, if we trigger a major compaction against our 2 SSTables:

And run sstable2json against our new SSTable…

We have the latest range tombstone plus the latest insert, and compactions have, as expected,  gotten rid of the previous insert as it knows everything older than the latest range tombstone is moot.

Now you can start to see where issues can arise when overwriting a key with a collection type. If it weren’t for the compaction, I’d have 2 tombstones for that single row across 2 SSTables. Obviously, it’s very likely those SSTables will compact and the tombstones will get cleared out, however things are not always as clear cut, especially when you are frequently overwriting keys and the tombstones get spread across many SSTables of differing sizes, causing tombstone bloat that may not be removed when left up to minor compactions.

So how can we avoid this potential catastrophe? A simple solution would be to instead store JSON and leave the updates to your application, however, there is an alternative. You can use the provided append and subtraction operators. These operators will modify the collection without having to perform a read, and also won’t create any range tombstones. This works for specific use cases where you simply need to insert/append/prepend, however, if you frequently find yourself having to rewrite a whole collection you will need to take a different approach. You can also specify a collection as frozen which would give the desired overwrite behaviour, but you will no longer be able to add and remove elements using the +, -, and [] operators.

Here is an example of performing collection operations on a list.

Be careful when using addition and subtraction on list types, as removing elements from a list can be an expensive operation. Cassandra will have to read in the entire list in order to remove a single entry. Note that this is not true for sets, removing a single entry from a set requires no reads, as Cassandra will simply write a tombstone for the matching cell.

See the below trace for deletion from a list, where we can clearly see C* performing a read query before making the modifications.

The following statements for the SET type result in similar functionality. Note that appending and prepending is non-existent with sets, it is simply added and remove.