Recently I created a multidimensional Bloom filter index for Apache Cassandra®. My goal was simple: to create the index using off-the-shelf Bloom filter components in an Eclipse® environment and run a Cassandra index unit test within Eclipse. The reality was this is not as simple as it appears, as there was minimal how-to documentation available.
All development was performed on an Ubuntu® 20.04.3 LTS system. The following are the specific changes I made to the system to support the code development.
Cassandra will compile with Java 11, but it was difficult to get all the pieces in sync for the integration test. So I installed Java 8 and made it the default JVM on my system.
I am running Eclipse version 2021-09 (4.21.0). I installed Java 8 as a JRE in the Eclipse preferences (Window→Preferences→Java→Installed JREs) and made Java 8 the default JDK for Eclipse. Making the JRE the default may not be required, but it is what I did.
Git is required to retrieve the Cassandra code. I have version 2.25.1 installed.
Ant is required to compile the Cassandra code and configure the Cassandra code for Eclipse usage. I have version 1.10.7 installed.
I targeted Cassandra v4, so I cloned the Apache Cassandra git repository and checked out the “cassandra-4.0” branch. I then ran
to set up Cassandra as an Eclipse project.
In Eclipse, import the Cassandra project as an existing Eclipse project. Locate the org.apache.cassandra.index.internal package under /src/java directory. The CassandraIndex class in that package will provide a basic framework and reference for your index.
Finally, create a project for your index.
Secondary indexes can use external index storage providers, like Apache Lucene®, or can use one or more internal SSTable structures to create the index. For internal SSTable structures, the CassandraIndex class provides a good reference. For external index storage, the Cassandra Lucene index provides a reasonable reference, though much of it is in Scala.
The Cassandra Secondary Index has a single class entry point. This Class is specified in the CQL “CREATE INDEX” command. The index must extend
or one of its subclasses.
This is the interface that is called whenever rows are added or deleted from the base table. The Index implementation should ensure that instances of this class are only created for the insert commands that contain the indexed field(s).
In the CassandraIndex class, the Indexer implementation is an inner class. I prefer to implement it as a standard class in the same package as the Index; this forces a separation of concerns and makes the interface much cleaner.
This interface is what is called whenever a search is performed. It is expected to return the keys to the rows in the base table that satisfy the search. As with the Indexer, the Index implementation should verify the parameter is available in the expression before instantiating this class.
In the CassandraIndex class, the Searcher implementation is also an inner class. As before, I prefer to implement it as a standard class.
The Cassandra index does not implement a serializer/deserializer (Serde) class to read/write the index table or external storage. Instead, it places all the index table manipulation functions directly in the CassandraIndex class. As with the Indexer and Searcher implementations, I prefer to create a Serde class to read/write the data from the underlying data store. This allows the Indexer and Searcher to prepare the requests for processing by converting from the Cassandra nomenclature to the underlying index nomenclature and then call the Serde class to read/write the underlying objects.
For example, in the Cassandra Bloom filter index implementation, each row in the base table has one column that needs to be processed. That processing splits the column into multiple pieces that have to be written to the SSTable underlying the index. The CassandraIndex implementation ensures that the column is in the inserted row or the expression. It then calls the Indexer or Searcher as appropriate. The Indexer and Searcher implementations extract the column data from the row or expression, break it into multiple pieces and then call the Serde class to read, insert, delete, update the index SSTable.
The hardest part of Cassandra index development is testing the code inside the Eclipse environment, where you can step through the debugger. There are a number of possible solutions involving containers running Cassandra and attaching remote debuggers. Still, I wanted to use the same tools that the Cassandra development community uses to test the secondary index implementations within the Cassandra codebase.
Since the Cassandra development community has done a lot of work to ensure that the testing environment is efficient, it makes sense to use their environment to execute CQL based tests of the index.
Adjusting the Cassandra Project
Open the Cassandra project in Eclipse. Go to Project→Properties→Java Build Path→Projects and add your index project as well as any other projects your project is dependent upon. This will place your code in the Cassandra project class path.
Find the org.apache.cassandra.index in the Cassandra project.internal package under the test/unit source directory. Create your index unit test here using CassandraIndexTest as a model.
- Your class should extend CQLTester.
- Your class should use Junit to denote the test methods.
- Each method should create the base table and index so that there are clean tables.
- Each method should use CQL commands to manipulate the data.
Running the Tests
While in the Cassandra project right click on the test class and select Run as → Unit test. If all goes well, your tests will run as expected.
Break points can be placed in your Index implementation class and any associated classes in your project. This will allow you to step through your code to ensure that it is functioning as you expect.
Adding the Test Class to the Index Code
The preceding instructions create a unit test class for your code in the Cassandra project. But you can not store the test code there. I copied the Cassandra based unit test code into the src/test/resources of my project. In this way, I have the test class checked into my Git repository, and it does not cause problems by being in the src/test/java directory structure where it will not find the Cassandra CQL test code.
Good luck and happy coding!