Data Modelling Recommended Practices

The following provides a summary of the database schema features that Instaclustr checks when reviewing customer Cassandra schemas.

We recommend that you consider these practices when designing you Cassandra data model and, depending on your level of Cassandra expertise, consider requesting a full schema review from our support team. Depending on your support level, schema reviews can be conducted free of charge or at our standard rate. Contact [email protected] to schedule a review.

KeySpace Review

Item	Rationale
Is the replication factor set to at least 3?	Replication factor of at least 3 is required for Instaclustr SLAs to apply and highly recommended for data protection and high availability.
Is the replication strategy set to network topology strategy?	Network topology strategy is highly recommended to ensure data is replicated to minimise impact of likely failures in underlying infrastructure environment (eg replicate across AWS availability zones) and to enable additional data centers to be added to the cluster without table rebuilds.
Is the data center name correct?	Data center name specified in replication strategy must match the configured Cassandra data center name (viewable in Instaclustr console).
Durable writes set to true (default)	Setting durable writes to false introduces a risk of lost writes in the event of failure (for a small improvement in write performance).

Table / Column Family Review

Item	Rationale
*Schema*
Is it properly denormalised? Does it require multiple queries to fetch information, or could the table just include info from the other table? Is there potential to consolidate data from multiple tables?	Developers from relational background may tend to normalised models resulting in inefficient use of Cassandra.
Partition key cardinality allows high number of partitions (minimum 100,000 possible preferred)	A low number of partitions will lead to inefficient read and writes and increase risk of unevenly sized partitions
Partition key prevents substantial skewing of partitions?	If it is possible for a small number of partitions to have vastly higher numbers of rows than average (say 100x) then this can cause significantly uneven performance and disk usage.
Using collections (maps,list,set)? Number of elements is 64k, keep the total size of the collect small (<1MB) as the map is not paged.	Very large collections can negatively impact read/write performance.
Is gc_grace_seconds changed from default (864000)? If so, is that appropriate and impact considered?	Lowering gc_crace_seconds results in space being reclaimed more quickly after deletes but runs small risk of “resurrected deletes” given we only run repairs weekly.
Is caching set to KEYS_ONLY or NONE?	Row caching for Cassandra 2.0 is often not effective. 2.1 row caching features may be effective if tuned correctly (see row-caching-in-cassandra-2-1 and Cassandra Docs)
Is chosen compaction strategy appropriate?	SizeTieredCompactionStrategy: default and suitable as a starting point for most uses cases with balance of reads and writes LevelledCompactionStrategy: does more compaction work to improve read performance. Generally used if high ratio of reads to writes. DateTieredCompactionStrategy: useful for data where data is “hot” when first written but sees less access over time. Check that the compaction strategy is appropriately tuned (see Cassandra Docs) defaults are usually ok, but DTCS requires specific compaction options set to be effective.
Are counters used?	Instaclustr only supports the use of counters with Cassandra 2.1 as Cassandra 2.0 counters are unreliable in many circumstances.

*Secondary Indexes*
Is cardinality of secondary index low?	Cardinality of index should be at least an order of magnitude lower and preferable at least 100x lower than indexed table. Also secondary indexes on boolean columns are not effective. See Cassandra Docs
Is the indexed column frequently updated/deleted?	Overhead of maintaining index will be incurred on each update/delete and may also result in excessive tombstones in the index table.

*Queries*
Are there logged batches used? If so, are they relatively small (<100)	Logged batches require coordinate node to control all operations and can result in very high load on coordinator node for large batches. Logged batches are only required for atomic operations across multiple rows/tables (not performance).
Are there unlogged batches? If so, are they small (<100) or on the same partition key?	Unlogged batches can improve performance but need to either be small or on a single partition key otherwise they can negatively impact performance. Note that unlogged batches do not provide atomic operations.
For large range queries, is the client paging through results?	Paging is necessary to read large results sets without memory constraints. Most drivers have inbuilt paging support but needs to be explicitly turned on in query code.
Does the query on the index lookup a row in a large partition?	Whole partition will be scanned to find matching rows – potentially expensive reads.