Apache Cassandra has become one of the most broadly adopted distributed databases for large-scale, highly available applications since its launch as an open source project in 2008. The 5.0 release in September 2024 represents the most substantial advancement to the project since 4.0 released in July 2021. Multiple customers (and our own internal Cassandra use case) have now been happily running on Cassandra 5 for up to 12 months so we thought the time was right to explore the key features they are leveraging to power their modern applications.
An overview of new features in Apache Cassandra 5.0
Apache Cassandra 5.0 introduces core capabilities aimed at AI-driven systems, low-latency analytical workloads, and environments that blend operational and analytical processing.
Highlights include:
- The new vector data type and an Approximate Nearest Neighbor (ANN) index based on Hierarchical Navigable Small World (HNSW), which is integrated into the Storage-Attached Index (SAI) architecture
- Trie-based memtables and the Big Trie-Index (BTI) SSTable format, delivering better memory efficiency and more consistent write performance
- The Unified Compaction Strategy, a tunable density-based approach that can align with leveled or tiered compaction patterns.
Additional enhancements include expanded mathematical CQL functions, dynamic data masking, and experimental support for Java 17.
At NetApp, Apache Cassandra 5.0 is fully supported, and we are actively assisting customers as they transition from 4.x.
A deeper look at Cassandra 5.0’s key features
Storage–Attached Indexes (SAI)
Storage–Attached Indexes bring a modern, storage-integrated approach to secondary indexing in Apache Cassandra, resolving many of the scalability and maintenance challenges associated with earlier index implementations. Legacy Secondary Indexes (2i) and SASI remain available, but SAI offers a more robust and predictable indexing model for a broad range of production workloads.
SAI operates per-SSTable, allowing queries to be indexed locally versus the cluster-wide coordination required of other strategies. This model supports diverse CQL data types, enables efficient numeric and text range filters, and provides more consistent performance characteristics than 2i or SASI. The same storage-attached foundation is also used for Cassandra 5’s vector indexing mechanism, allowing ANN search to operate within the same storage and query framework.
SAI supports combining filters across multiple indexed columns and works seamlessly with token-aware routing to reduce unnecessary coordinator work. Public evaluations and community testing have shown faster index builds, more predictable read paths, and improved disk utilization compared with previous index formats.
Operationally, SAI functions as part of the storage engine itself: indexes are defined using standard CQL statements and are maintained automatically during flush and compaction, with no cluster-wide rebuilds required. This provides more flexible query options and can simplify application designs that previously relied on manual denormalization or external indexing systems.
Native Vector Search capabilities
Apache Cassandra 5.0 introduces native support for high-dimensional vector embeddings through the new vector data type. Embeddings represent semantic information in numerical form, enabling similarity search to be performed directly within the database. The vector type is integrated with the database’s storage-attached index architecture, which uses HNSW graphs to efficiently support ANN search across cosine, Euclidean, and dot-product similarity metrics.
With vector search implemented at the storage layer, applications involving semantic matching, content discovery, and retrieval-oriented workflows while maintaining the system’s established scalability and fault-tolerance characteristics are supported.
After upgrading to 5.0, existing schemas can add vector columns and store embeddings through standard write operations. For example:
|
1 2 3 4 5 |
UPDATE products SET embedding = [0.1, 0.2, 0.3, 0.4, 0.5] WHERE id = <id>; |
To create a new table with a vector type column:
|
1 2 3 4 5 6 7 |
CREATE TABLE items ( product_id UUID PRIMARY KEY, embedding VECTOR<FLOAT, 768> // 768 denotes dimensionality ); |
Because vector indexes are attached to SSTables, they participate automatically in the compaction and repair processes and do not require an external indexing system. ANN queries can be combined with regular CQL filters, allowing similarity searches and metadata conditions to be evaluated within a unified distributed query workflow. This brings vector retrieval into Apache Cassandra’s native consistency, replication, and storage model.
Unified Compaction Strategy (UCS)
Unified Compaction Strategy in Apache Cassandra 5 included a density-aware approach to organizing SSTables that blends the strengths of Leveled Compaction Strategy (LCS) and Size Tiered Compaction Strategy (STCS). UCS aims to provide the predictable read amplification associated with LCS and the write efficiency of STCS, without many of the workload-specific drawbacks that previously made compaction selection difficult. Choosing an unsuitable compaction strategy in earlier releases could lead to operational complexity and long-term performance issues, which UCS is designed to mitigate.
UCS exposes a set of tunable parameters like density thresholds and per-level scaling that let operators adjust compaction behavior toward read-heavy, write-heavy, or time-series patterns. This flexibility also helps smooth the transition from existing strategies, as UCS can adopt and improve the current SSTable layout without requiring a full rewrite in most cases. The introduction of compaction shards further increases parallelism and reduces the impact of large compactions on cluster performance.
Although LCS and STCS remain available (and while STCS remains the default strategy in 5.0, UCS is the default strategy on newly deployed NetApp Instaclustr’s managed Apache Cassandra 5 clusters), UCS supports a broader range of workloads, reduces the operational burden of compaction tuning, and aligns well with other storage engine improvements in Apache Cassandra 5 such as trie-based SSTables and Storage-Attached Indexes.
Trie Memtables and Trie-Indexed SSTables
Trie Memtables and Trie-indexed SSTables (Big Trie-Index, BTI) are significant storage engine enhancements released in Apache Cassandra 5. They are designed to reduce memory overhead, improve lookup performance, and increase flush efficiency. A trie data structure stores keys by shared prefixes instead of repeatedly storing full keys, which lowers object count and improves CPU cache locality compared with the legacy skip-list memtable structure. These benefits are particularly visible in high-ingestion, IoT, and time-series workloads.
Skip-list memtables store full keys for every entry, which can lead to large heap usage and increased garbage collection activity under heavy write loads. Trie Memtables substantially reduce this overhead by compacting key storage and avoiding pointer-heavy layouts. On disk, the BTI SSTable format replaces the older BIG index with a trie-based partition index that removes redundant key material and reduces the number of key comparisons needed during partition lookups.
Using Trie memtables requires enabling both the trie-based memtable implementation and the BTI SSTable format. Existing BIG SSTables are converted to BTI through normal compaction or by rebuilding data. On NetApp Instaclustr’s managed Apache Cassandra clusters Trie Memtables and BTI are enabled by default, but when upgrading major versions to 5.0, data must be converted from BIG to BTI first to utilize Trie structures.
Other new features
Mathematical CQL functions
Apache Cassandra 5.0 added a rich set of math functions allowing developers to perform computations directly within queries. This reduces data transfer overhead and reduces client-side post-processing, among many other benefits. From fundamental functions like ABS(), ROUND(), or SQRT() to more complex operations like SIN(), COS(), TAN(), these math functions are extensible to a multitude of domains from financial data, scientific measurements or spatial data.
Dynamic Data Masking
Dynamic Data Masking (DDM) is a new feature to obscure sensitive column-level data at query time or permanently attach the functionality to a column so that the data always returns obfuscated. Stored data values are not altered in this process, and administrators can control access through role-based access control (RBAC) to ensure only those with access can see the data while also tuning the visibility of the obscured data. This feature helps with adherence to data privacy regulations such as GDPR, HIPAA, and PCI DSS without needing external redaction systems.
Conclusion
Apache Cassandra 5.0 packs a punch with game changing features that meet the needs of modern workloads and applications. Features like vector search capabilities and Storage Attached Indexes stand out as they will inevitably shape how data can be leveraged within the same database while maintaining speed, scale, and resilience.
When you deploy a managed cluster on NetApp Instaclustr’s Managed Platform, you get the benefits of all these amazing features without worrying about configuration and maintenance.
Ready to experience the power of Apache Cassandra 5.0 for yourself? Try it free for 30 days today!