Cassandra Collections: Hidden Tombstones

Overview

Multi-value data types (sets, lists and maps) are a powerful feature of Cassandra, aiding you in denormalisation while allowing you to still retrieve and set data at a very fine-grained level. However, some of Cassandra’s behaviour when handling these data types is not always as expected and can cause issues.

In particular, there can be hidden surprises when you update the value of a collection type column. For simple-type columns, Cassandra performs an update by simply writing a new value for the cell and the most recently written value wins when the data is read. However, when you overwrite a collection Cassandra can’t simply write the new elements because all the existing elements in the map have their own individual cells and would still be returned alongside the new elements whenever a read is performed on the map.

The options

This leaves Cassandra with two options:

Perform a read and discover all the existing map elements and either delete them or update them if they were specified in the overwrite.
Forget about all existing elements in the map by deleting them.

Option 1 doesn’t sound very optimised, does it? A read for every write you perform? Ouch.
Cassandra chooses option 2 because it just can’t resist those performance gains. It knows you’re performing an overwrite, and that you obviously don’t care about the contents of those columns, so it will delete them for you, and we can all pretend they never existed in the first place.

Or so we thought… until one day your queries start failing because you’ve hit 100k tombstones. Didn’t expect that, especially when you never delete any data.
In most cases, compactions will just handle this problem for you and the tombstones will be gone before you even get close to the query failure limit. However, compaction strategies aren’t perfect and depending on how much you overwrite, plus how well compactions remove those tombstones, there are many cases where this behaviour can become a huge issue. If you are performing many writes, and all of them are overwrites where a collection type is involved, you will be generating a tombstone for every single write.

Examples for avoiding the issue

I’ve created a very basic schema with a map and a few fields, as below:

CREATE KEYSPACE tombs WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;

CREATE TABLE tombs.staff (

id text PRIMARY KEY,

name text,

age int

locations map<text, text>;

)

I then inserted a single row and performed a flush:

1 2	insert into staff (id, locations) values ('a', {'bldg1':'4a'}); nodetool flush

And I now have an SSTable in my tombs.staff data directory.

/var/lib/cassandra/data/tombs/staff/ $ ls -l

-rw-r--r-- 1 root root 43 Jun 20 11:53 tombs-staff-ka-1-CompressionInfo.db

-rw-r--r-- 1 root root 82 Jun 20 11:53 tombs-staff-ka-1-Data.db

-rw-r--r-- 1 root root 10 Jun 20 11:53 tombs-staff-ka-1-Digest.sha1

-rw-r--r-- 1 root root 16 Jun 20 11:53 tombs-staff-ka-1-Filter.db

-rw-r--r-- 1 root root 15 Jun 20 11:53 tombs-staff-ka-1-Index.db

-rw-r--r-- 1 root root 4449 Jun 20 11:53 tombs-staff-ka-1-Statistics.db

-rw-r--r-- 1 root root 83 Jun 20 11:53 tombs-staff-ka-1-Summary.db

-rw-r--r-- 1 root root 91 Jun 20 11:53 tombs-staff-ka-1-TOC.txt

Using sstable2json to analyse the data, as expected we have one key, a, however it has two locations entries, despite the fact we only did one write.

This is to do with the map, and the whole overwrite thing I was talking about earlier. Already we can see that C* has written a range tombstone for the locations cell immediately before writing the value that I inserted.

$ sstable2json tombs-staff-ka-1-Data.db

[

{"key": "a",

"cells": [["","",1466423546119508],

# Below is a range tombstone against locations.

# Note the timestamp occurs just before the next entry and the "t" - for tombstone.

["locations:_","locations:!",1466423546119507,"t",1466423546],

# and here we have the map entry, ASCII encoded. Where 626c646731="bldg1" and 3461="4a".

["locations:626c646731","3461",1466423546119508]]}

Now, this is kind of a spoiler, as we haven’t actually done any “overwrites” yet, but we’ve identified the feature we’re talking about. This is because in Cassandra, overwrites, updates, and inserts, are really all just the same thing. The insert against the map will do the same thing whether the key already exists or not.

Anyway, we can see how this delete first strategy begins to work if we simply insert another record with the same key:

1 2	insert into staff (id, locations) values ('a', {'bldg4':'4c'}); nodetool flush

We now have 2 sstables: tombs-staff-ka-1-Data.db and tombs-staff-ka-2-Data.db. And if we run sstable2json on the new SSTable, we see a very similar entry:

$ sstable2json tombs-staff-ka-2-Data.db

[

{"key": "a",

"cells": [["","",1466424601968750],

["locations:_","locations:!",1466424601968749,"t",1466424601],

["locations:626c646731","3464",1466424601968750]]}

Nothing surprising, and furthermore, if we trigger a major compaction against our 2 SSTables:

1	$ nodetool compact tombs staff

And run sstable2json against our new SSTable…

$ sstable2json tombs-staff-ka-3-Data.db

[

{"key": "a",

"cells": [["","",1466424601968750],

["locations:_","locations:!",1466424601968749,"t",1466424601],

["locations:626c646731","3464",1466424601968750]]}

We have the latest range tombstone plus the latest insert, and compactions have, as expected, gotten rid of the previous insert as it knows everything older than the latest range tombstone is moot.

Now you can start to see where issues can arise when overwriting a key with a collection type. If it weren’t for the compaction, I’d have 2 tombstones for that single row across 2 SSTables. Obviously, it’s very likely those SSTables will compact and the tombstones will get cleared out, however things are not always as clear cut, especially when you are frequently overwriting keys and the tombstones get spread across many SSTables of differing sizes, causing tombstone bloat that may not be removed when left up to minor compactions.

So how can we avoid this potential catastrophe? A simple solution would be to instead store JSON and leave the updates to your application, however, there is an alternative. You can use the provided append and subtraction operators. These operators will modify the collection without having to perform a read, and also won’t create any range tombstones. This works for specific use cases where you simply need to insert/append/prepend, however, if you frequently find yourself having to rewrite a whole collection you will need to take a different approach. You can also specify a collection as frozen which would give the desired overwrite behaviour, but you will no longer be able to add and remove elements using the +, -, and [] operators.

Here is an example of performing collection operations on a list.

ALTER TABLE staff ADD leave_dates list<text>; # Creates a tombstone and an entry in the list

insert into staff (id, leave_dates) values ('c', ['20160620']);

$ nodetool flush

$ sstable2json tombs-staff-ka-6-Data.db

[

{"key": "c", "cells": [["","",1466427765961455],

["leave_dates:_","leave_dates:!",1466427765961454,"t",1466427765],

["leave_dates:484b79b036e711e681757906eb0f5a6e","3230313630363230",1466427765961455]]}

]

# Prepends an element to the list without creating any additional tombstones

UPDATE staff SET leave_dates = [ '20160621' ] + leave_dates where id='c';

$ nodetool flush # The new SSTable has only a single entry in the list, no extra tombstone.

# This works the same for appending to the list as well.

$ sstable2json tombs-staff-ka-7-Data.db

[

{"key": "c", "cells": [["leave_dates:af13b22fb5e911d781757906eb0f5a6e","3230313630363231",1466427869996855]]}

]

Be careful when using addition and subtraction on list types, as removing elements from a list can be an expensive operation. Cassandra will have to read in the entire list in order to remove a single entry. Note that this is not true for sets, removing a single entry from a set requires no reads, as Cassandra will simply write a tombstone for the matching cell.

See the below trace for deletion from a list, where we can clearly see C* performing a read query before making the modifications.

# Removing a single index
instaclustr@cqlsh:tombs&gt; DELETE leave_dates[1] FROM staff where id='c';
# Trace:
activity                                               | timestamp               | source      | source_elapsed
Execute CQL3 query                                     | 2016-06-20 13:34:26.214 | 52.37.11.43 | 0
Parsing DELETE leave_dates[1] FROM staff where id='c'; | 2016-06-20 13:34:26.217 | 52.37.11.43 | 148
Preparing statement                                    | 2016-06-20 13:34:26.217 | 52.37.11.43 | 272
Executing single-partition query on staff              | 2016-06-20 13:34:26.217 | 52.37.11.43 | 794
Acquiring sstable references                           | 2016-06-20 13:34:26.218 | 52.37.11.43 | 829
Merging memtable tombstones                            | 2016-06-20 13:34:26.218 | 52.37.11.43 | 845
Partition index with 0 entries found for sstable 13    | 2016-06-20 13:34:26.219 | 52.37.11.43 | 905
Seeking to partition beginning in data file            | 2016-06-20 13:34:26.219 | 52.37.11.43 | 927
Partition index with 0 entries found for sstable 12    | 2016-06-20 13:34:26.219 | 52.37.11.43 | 1037
Seeking to partition beginning in data file            | 2016-06-20 13:34:26.219 | 52.37.11.43 | 1049
Skipped 0/2 non-slice-intersecting sstables            | 2016-06-20 13:34:26.220 | 52.37.11.43 | 3391
Merging data from memtables and 2 sstables             | 2016-06-20 13:34:26.220 | 52.37.11.43 | 3409
Read 3 live and 1 tombstone cells                      | 2016-06-20 13:34:26.220 | 52.37.11.43 | 3523
Determining replicas for mutation                      | 2016-06-20 13:34:26.221 | 52.37.11.43 | 3899
Appending to commitlog                                 | 2016-06-20 13:34:26.222 | 52.37.11.43 | 3950
Adding to staff memtable                               | 2016-06-20 13:34:26.222 | 52.37.11.43 | 3968
Request complete                                       | 2016-06-20 13:34:26.218 | 52.37.11.43 | 4877

# Removing a single index

instaclustr@cqlsh:tombs> DELETE leave_dates[1] FROM staff where id='c';

# Trace:

activity | timestamp | source | source_elapsed

Execute CQL3 query | 2016-06-20 13:34:26.214 | 52.37.11.43 | 0

Parsing DELETE leave_dates[1] FROM staff where id='c'; | 2016-06-20 13:34:26.217 | 52.37.11.43 | 148

Preparing statement | 2016-06-20 13:34:26.217 | 52.37.11.43 | 272

Executing single-partition query on staff | 2016-06-20 13:34:26.217 | 52.37.11.43 | 794

Acquiring sstable references | 2016-06-20 13:34:26.218 | 52.37.11.43 | 829

Merging memtable tombstones | 2016-06-20 13:34:26.218 | 52.37.11.43 | 845

Partition index with 0 entries found for sstable 13 | 2016-06-20 13:34:26.219 | 52.37.11.43 | 905

Seeking to partition beginning in data file | 2016-06-20 13:34:26.219 | 52.37.11.43 | 927

Partition index with 0 entries found for sstable 12 | 2016-06-20 13:34:26.219 | 52.37.11.43 | 1037

Seeking to partition beginning in data file | 2016-06-20 13:34:26.219 | 52.37.11.43 | 1049

Skipped 0/2 non-slice-intersecting sstables | 2016-06-20 13:34:26.220 | 52.37.11.43 | 3391

Merging data from memtables and 2 sstables | 2016-06-20 13:34:26.220 | 52.37.11.43 | 3409

Read 3 live and 1 tombstone cells | 2016-06-20 13:34:26.220 | 52.37.11.43 | 3523

Determining replicas for mutation | 2016-06-20 13:34:26.221 | 52.37.11.43 | 3899

Appending to commitlog | 2016-06-20 13:34:26.222 | 52.37.11.43 | 3950

Adding to staff memtable | 2016-06-20 13:34:26.222 | 52.37.11.43 | 3968

Request complete | 2016-06-20 13:34:26.218 | 52.37.11.43 | 4877

The following statements for the SET type result in similar functionality. Note that appending and prepending is non-existent with sets, it is simply added and remove.

ALTER TABLE staff ADD leave_dates list<text>;

# Creates a tombstone and an entry in the list

insert into staff (id, leave_dates) values ('c', ['20160620']);

$ nodetool flush

$ sstable2json tombs-staff-ka-6-Data.db

[

{"key": "c",

"cells": [["","",1466427765961455],

["leave_dates:_","leave_dates:!",1466427765961454,"t",1466427765],

["leave_dates:484b79b036e711e681757906eb0f5a6e","3230313630363230",1466427765961455]]}

]

# Prepends an element to the list without creating any additional tombstones

UPDATE staff SET leave_dates = [ '20160621' ] + leave_dates where id='c';

$ nodetool flush

# The new SSTable has only a single entry in the list, no extra tombstone.

# This works the same for appending to the list as well.

$ sstable2json tombs-staff-ka-7-Data.db

[

{"key": "c",

"cells": [["leave_dates:af13b22fb5e911d781757906eb0f5a6e","3230313630363231",1466427869996855]]}

]

Cassandra Collections: Hidden Tombstones and How to Avoid Them

Overview

The options

Examples for avoiding the issue

Get the latest articles for open sourceIn your inbox

Sign upto ourNewsletter

Get the latest articles for open source
In your inbox

Sign up
to our
Newsletter