Cassandra generates tombstones when you delete data. Under some circumstances, excess tombstones can cause long GC pauses, latency, read failures, or out of heap errors. This article provides advice for managing tombstones.
What is a tombstone?
In Cassandra, deleted data is not immediately purged from the disk. Instead, Cassandra writes a special value, known as a tombstone, to indicate that data has been deleted. Tombstones prevent deleted data from being returned during reads, and will eventually allow the data to be dropped via compaction.
Tombstones are writes – they go through the normal write path, take up space on disk, and make use of Cassandra’s consistency mechanisms. Tombstones can be propagated across the cluster via hints and repairs. If a cluster is managed properly, this ensures that data will remain deleted even if a node is down when the delete is issued.
Tombstones are generated by:
- DELETE statements
- Setting TTLs
- Inserting null values
- Inserting data into parts of a collection.
What is the normal lifecycle of tombstones?
Tombstones are written with a timestamp. Under ideal circumstances, tombstones (and their associated data) will be dropped during compactions after a certain amount of time has passed.
The following three criteria must be met for tombstones to be removed:
- The tombstones were created more than gc_grace_seconds ago.
- The table containing the tombstone is involved in a compaction.
- All sstables that could contain the relevant data are involved in the compaction.
Each table has a gc_grace_seconds setting. By default, this is set to 864000, which is equivalent to 10 days. The intention is to provide time for the cluster to achieve consistency via repairs (and hence, prevent the resurrection of deleted data).
Tombstones will only be dropped via a compaction if all sstables that could contain the relevant data are involved in the compaction. If a lot of time has elapsed between writing the original data and issuing the DELETE, this becomes less likely:
- Size-Tiered Compaction Strategy will compact sstables of similar size together. Data tends to move into larger sstables as it ages, so the tombstone (in a new, small sstable) is unlikely to be compacted with the data (in an old, large sstable).
- Leveled Compaction Strategy is split into many levels that are compacted separately. The tombstone will be written into level 0 and will effectively ‘chase’ the data through the levels – it should eventually catch up.
- Time-Window Compaction Strategy (or Date-Tiered Compaction Strategy) will never compact the tombstone with the data if they are written into different time windows.
When do tombstones cause problems?
When data is deleted, the space will not actually be freed for at least the gc_grace period set in the table settings. This can cause problems if a cluster is rapidly filling up.
Under some circumstances, the space will never be freed without manual intervention.
Serious performance problems can occur if reads encounter large numbers of tombstones.
Performance issues are most likely to happen with the following types of query:
- Queries that run over all partitions in a table (“select * from keyspace.table”)
- Range queries (“select * from keyspace.table WHERE value > x”, or “WHERE value IN (value1, value2, …)”
- Any query that can only be run with an “ALLOW FILTERING” clause.
These performance issues occur because of the behaviour of tombstones during reads. In a range query, your Cassandra driver will normally use paging, which allows nodes to return a limited number of responses at a time. When tombstones are involved, the node needs to keep the tombstones that it has encountered in memory and return them to the coordinator, in case one of the other replicas is unaware that the relevant data has been deleted. The tombstones cannot be paged because it is essential to return all of them, so latency and memory use increase proportionally to the number of tombstones encountered.
Whether the tombstones will be encountered depends on the way the data is stored and retrieved. For example, if Cassandra is used to store data in a queue (which is not recommended), queries may encounter tens of thousands of tombstones to return a few rows of data.
How can I diagnose tombstone-related problems?
Queries that encounter large numbers of tombstones will show up in the logs. By default, a read encountering more than a thousand tombstones will generate a warning:
WARN org.apache.cassandra.db.ReadCommand Read 0 live rows and 87051 tombstone cells for query SELECT * FROM example.table
By default, encountering more than 100,000 tombstones will cause the query to fail with a TombstoneOverwhelmingException.
To verify whether tombstone reads are causing performance problems, check whether the reads correlate with an increase in read latency and GC pause duration.
If it is clear that tombstones are the issues, the following techniques can help narrow down the scope of the problem:
- The number of tombstones returned in a particular query can be found by running the query in cqlsh with tracing enabled.
- Statistics for the number of tombstones encountered recently in each table are available in the output from
- For clusters in our managed service, statistics for recently encountered tombstones are available on the cluster page in Metrics Lists > Table Info. This includes live cells per read and average and max tombstones per read, broken down by node or table for a given time period.
- More detailed information on stored tombstones can be found using ic-tools.
How can I avoid tombstone issues?
The following options can help:
- Avoid queries that will run on all partitions in the table (eg queries with no WHERE clause, or any query that requires ALLOW FILTERING).
- Alter range queries to avoid querying deleted data, or operate on a narrower range of data. Performance problems only occur if the tombstones are read, and scale with the number of tombstones read.
- Design the data model to avoid deleting large amounts of data.
- If planning to delete all the data in a table, truncate or drop the table to remove all the data without generating tombstones.
- Use a default time-to-live value. This only works efficiently if the primary key of your data is time-based, your data is written in chronological order, and the data will be deleted at a known date. To do this, set a default TTL in the table-level options, and set a time-based compaction strategy (TimeWindowCompactionStrategy if available, DateTieredCompactionStrategy otherwise). This will still create tombstones, but whole sstables will be efficiently dropped once the TTL on all of their contents have passed.
How can I get rid of existing tombstones?
Under most circumstances, the best approach is to wait for the tombstone to compact away normally. If urgent performance or disk usage issues require more immediate action, there are two nodetool commands that can be used to force compactions, which can assist in dropping tombstones. These should be considered a last resort – in a healthy cluster with a well-designed data model, it is not necessary to run manual compactions.
nodetool compact forces a compaction of all sstables. This requires a large amount of free disk space. Keyspace and table arguments should be used to limit the compaction to the tables where tombstones are a problem. On tables where Size-Tiered Compaction Strategy is used, this command can lead to the creation of one enormous sstable that will never have peers to compact with; if the
–split-output flag is available, it should be used.
nodetool garbagecollect command is available from Cassandra 3.10 onwards. This command runs a series of smaller compactions that also check overlapping sstables. It is more CPU intensive and time-consuming than
nodetool compact, but requires less free disk space.
Tombstones will only be removed if gc_grace_seconds have elapsed since the tombstones were created. The intended purpose of gc_grace_seconds is to provide time for repairs to restore consistency to the cluster, so be careful when modifying it – prematurely removing tombstones can result in the resurrection of deleted data. Also, the gc_grace_seconds setting affects expiration of hints generated for hinted handoff, so it is dangerous to reduce gc_grace_seconds below the duration of the hinted handoff window (by default, 3 hours).
Incremental repair may complicate this process. When an incremental repair is run, the sstables it has affected are marked as repaired; in subsequent compactions, these tables will be compacted separately from sstables that have not been repaired. If tombstones are in unrepaired sstables and the shadowed data is in repaired sstables (or vice versa), the data cannot be dropped. It is possible to overcome this by marking all tables as unrepaired with sstablerepairedset.