TTLs (time to live) are very useful features of Cassandra. They essentially provide an automatic deletion timer on any piece of data inserted into Cassandra. They can be specified either as part of an insert or update statement or set by default in a table’s schema and apply down to the column/value level in the data. Once the TTL expires it is as if the data was deleted but with the added advantage that no tombstone record is required and the data can be completely erased through compactions relatively quickly.
However, in some circumstances TTLs can complicate matters if you wish to restore data either for disaster recovery or testing purposes. For example, we recently wanted to do some tuning of the Apache Spark jobs used for our Instametrics cluster. To achieve this we wanted a production-size data set that was relatively static. Our problem was our raw Instametrics data had quite a short TTL (24hrs for low-value data) and so by the time we copied a snapshot to a separate cluster and then loaded it, the data had already expired!
We could have likely hacked around this problem by winding back the clock on the test cluster but we figured there had to be a better solution, and this was also a good opportunity for some of our developers to get some more hands-on experience with core Cassandra code and SSTable formats. As a result, we produced a tool called TTLRemover which streams data from an existing SSTable to a new SSTable, removing TTL fields along the way. The result is a set of SSTables without TTLs, so that your data doesn’t disappear before you have loaded it.
We figured we probably weren’t the only people looking for a solution to this problem so we’ve published the tool to our GitHub here: https://github.com/instaclustr/TTLRemover. The tool is still pretty basic and, to be honest, a bit of a hack but if you’re stuck trying to retrieve TTL’s data then it will quite possibly be very useful to you.
This is the first, small step in our plans to increase our contributions to the open source community both through publishing our internal tools where that makes sense and direct contributions to the major open source projects that we use.