Apache Cassandra® version 5.0 Beta 1.0 is now available in public preview on the Instaclustr Managed Platform!
Here at NetApp, we’re excited to see the new Apache Cassandra version 5.0 release and the extended functionality that has been introduced. We have been testing the newest features that come along with this version, and in particular, getting to understand the latest secondary indexing architecture with Storage Attached Indexing (SAI).
Cassandra has always offered a secondary index, which can extend your read options by querying the database with values that are not the primary key. However, the overhead of servicing these queries across a distributed system has always been a concern due to the time and resource costs involved.
Storage Attached Indexes addresses these challenges by reducing storage overhead and enhancing query performance through optimizations on index storage and access paths. So, you can see why we’re keen to work with Storage Attached Indexing and help our customers get the most out of what is already a pretty amazing big data database.
What is Secondary Indexing and why does it matter to my Cassandra cluster?
Cassandra is designed to manage writes and reads at a large scale, but read operations– when querying for data on non-primary keys–are resource–intensive and more complex than what you find when querying in a ‘traditional’ relational database. Indexing is a technique used to quickly access and retrieve entries in a database and is typically adopted by developers to optimize read performance.
With Cassandra’s secondary index, developers have the option to configure query data on specific values over and above the primary key. Secondary Indexes solve the problem of querying columns that are not part of the primary key.
Essentially, developers choose to use a secondary index with Cassandra clusters to expand the flexibility of queries that they want to run across their data. Without a secondary index, Cassandra is very limited to what can be queried beyond the primary key. Instaclustr offers customers reliable and scalable systems data layer technologies to manage data at scale with the least hassle and lowest cost.
Having improved data access and retrieval comes with additional considerations. Secondary indexes rely on local indexing, which means index data is co-located with the source data on the same nodes. When retrieving data using only an indexed column, Cassandra doesn’t have a way to determine which nodes have the necessary data. Consequently, it ends up querying all the nodes in the cluster.
This process can cause increased data transfer, high latency, and a potential increase in costs, especially if the cluster has many nodes. As with adopting any new feature, it’s recommended to test secondary indexes with your application to ensure cluster performance is acceptable. Read here to understand more about the cost and benefits of Cassandra Filtering and Partition Keys.
What’s Different for Secondary Indexing in Cassandra 5.0?
Over the years, the Apache Cassandra community has been evolving functionality to get the most efficient and effective method for reading the vast quantities of data that the database can manage.
The Cassandra 5.0 Storage Attached Index (SAI) functionality has evolved from previous Secondary Indexes (2i) and SSTable Attached Secondary Index (SASI); the SAI architecture has been designed to operate as a native database operation, reducing query latency and resource usage, while maintaining the scalability, fault tolerance and performance that Cassandra is known for.
Design improvements with SAI include:
- Enables vector search through the new storage-attached index (SAI) called “VectorMemtableIndex”
- Lower disk usage: compared to SASI, SAI doesn’t create ngrams (contiguous subset of a longer sequence) for every term, saving a significant amount of disk space. Additionally, SAI eliminates the need to duplicate an entire table with a new partition key.
- Simplified queries: newly introduced query syntax eliminates the need to create an SSTable for each query pattern. This means multiple indexes on one table, rather than multiple tables for multiple query patterns.
Storage Attached Indexing and Vector Search
Another exciting and significant enhancement delivered by Cassandra 5.0 is support for a Vector CQL data type. Support for this new data type means that Cassandra will handle and store embedding vectors as arrays of floating-point numbers. When coupled with Storage–Attached Indexing, the vector CQL data type introduced in Cassandra 5.0 enables Cassandra to support AI/ML workloads.
Cassandra natively manages scalability for large datasets and is the obvious choice to handle large volumes and accelerated growth of vector data as organization’s explore and establish AI/ML capabilities.
We’re excited to work with our Instaclustr customers and support them as they get started on the AI/ML journey, making the right choices at the beginning of that journey to get the best value for their business.
Getting Started with Storage Attached Indexing
At Instaclustr, we make it easy to provision and manage the operations of your data infrastructure so you can focus on accelerating development of your business applications. Trial the new Storage Attached Index feature with Apache Cassandra today and set up a non-production cluster using the Instaclustr console. Click here for a guide on how to provision an Apache Cassandra 5.0 Beta 1.0 cluster.
If you have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.