By Ben Bromhead Tuesday 7th June 2016

Apache Cassandra 3.x and Materialized Views



Apache Cassandra 3.0 introduces a new feature called materialized views. Materialized views behave like they do in other database systems, you create a table that is populated by the results of a query. Cassandra also keeps the materialized view up to date based on the data you insert into the base table. Whilst the feature itself sounds very simple, it becomes very powerful when working with a denormalized schema where you often end up writing the data multiple times in a way that will fit future reads.

By leveraging materialized views you can have some of this logic live in the database and let Cassandra keep everything up to date (in a somewhat eventually consistent manner). Materialized views also allow you to replace some of the functionality given to us by secondary index, which can be painful to manage and create performance bottlenecks.  Below is a basic example of how you can use materialized views to replace a secondary index for much better performance.

This example is based on a the idea of a dating app whereby users are matched with other users based on an arbitrary algorithm (in this example its not relevant). The matches are stored in Cassandra and the users can accept or reject the matches as they see fit.


First lets create our basic schema.

Our application will have replication factor of 3

Our base table will contain all our users and some basic information.

This table will contain a set of matched user pairs. Normally this is generated by our magic machine learning pipeline built with spark… but for this example we’ll just use some dummy data. We will also create our secondary index so we can query the user_matches table by user_id and state. This will allow our app to display a list of ACTIVE matches to the logged in user.

Excellent, now that we have some dummy data let’s see what matches will show up when the user Jordan decides to login.

The actual trace result shows over 2500 different operations, but the key thing to note is the co-ordinator started to enqueue requests to *all* other nodes in the cluster, despite only having a replication factor of 3. This means as our netflix and chill app grows in popularity and we have to add more nodes our queries will get slower as the index is distributed over all nodes.

Enter the materialized view, it allows us to recreate this secondary index but as a normal Cassandra table where we have total control over the partition key.

Here we have told Cassandra to create a new table with data from the user_matches table with a new partition key format. We were even able to include new fields in our primary key. You can create a materialized view at any point and Cassandra will set off a back ground process to populate your materialized view with data from the existing table.

Let’s now run that same query against our new materialized view.

Wow that is a world of difference, the query only hit the primary replica as the query now matches the primary key for that table!

Users like Jordan now get their matches much quicker now!

Let’s see what happens when Jordan swipes left:

The change to user_matches has flowed through to the materialized view user_match_states. Perfect!

There are many other tasks you’ll be able to accomplish with materialized views and as support matures you’ll be able to mix in things like user defined functions and aggregations, for even more powerful materialized views.

For further reading checkout the following links:

Site by Swell Design Group