Getting to know Apache Cassandra!

Apache Cassandra
Technical

February 20, 2015
By Instaclustr

Ben Slater is Instaclustr’s newly appointed Chief Product Officer and he has been charged with steering our development roadmap and overseeing the engineering of our products and managing our production support team.

Ben has over 20 years of experience in systems development including previous stints with a software product company and, for the last 10 years, running large teams for Accenture, a leading global system integrator. He has extensive experience in managing development teams and implementing quality controlled engineering practices.

I am the first to admit that I have limited experience with Cassandra and NoSQL technologies. However, I have had extensive experience over the last 20 years in designing and developing enterprise applications and Internet solutions that have relied on traditional RDMS foundations and I am more than curious about Cassandra and where it fits in.

I think my background puts me in a position, as someone joining the Cassandra community from an enterprise IT background, to offer some observations and insights along the way. Hopefully, there will be some information that’s useful to others trying to get their head around Cassandra and I’ll get some feedback that will help me learn and understand the strengths of Cassandra and related technologies.

The first thing I’ve been trying to understand is “what is Cassandra? What does it do and why would I want to use it?”

Wikipedia tells me:

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

Planet Cassandra says:

Apache Cassandra, a top level Apache project born at Facebook and built on Amazon’s Dynamo and Google’s BigTable, is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers while providing highly available service with no single point of failure. … Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. The Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency, helping drive down costs of ownership while greatly increasing the value of a business’s big data environment.

I pretty quickly get the point that Cassandra offers a solution for relatively cheaply and easily deploying a highly scalable and reliable distributed database architecture. A little bit more digging and I now understand that Cassandra is much simpler and more capable than the traditional enterprise RDBMS systems I am familiar with when it comes to providing horizontal scalability, true high availability and replication across data centres for disaster recovery and other requirements.

However, Cassandra also has some limitations when compared to an RDBMS which clearly means it’s can’t simply be considered a like for like replacement technology. One of the first such limitations that I see is that the query language for Cassandra (CQL) is very narrow when compared to SQL with no joins or aggregations. Also, Cassandra’s version of transactions is limited compared to a traditional RDBMS.

So, given these pluses and minuses, where does Cassandra fit into a solution architect’s toolbox? The conclusion I’ve come to is that Cassandra can be thought of as being analogous to an application state serialization file with associated access logic but for modern applications where you need virtually unlimited horizontal scalability, 100% availability and quick response times across massive data volumes. By this, I mean that the design and usefulness of Cassandra data stores is closely bound to the function of the applications that create and consume the stores. This contrasts with normalised relational database design where you are seeking to implement a “pure” logical representation of the real world structure of the data to allow it to be accessed and updated as flexibly as possible.

To me, this means solutions built around Cassandra need to be designed differently than those based on an RDBMS. Denormalization and the limited flexibility of the CQL language put more responsibility on the application layer than with a traditional RDBMS system. Interestingly, in many ways, I believe this can be a good thing as it means all the complex logic lives in your application where it can be maintained in a single organised structure and understood by developers with a single language skillset.

In any event, I suspect most RDBMS-based systems with highly demanding performance and availability requirements end up moving away from a pure logical data model to something fairly tightly bound to the critical functions that must be supported.

The other good news with Cassandra is that to some extent you can have your cake and eat it too by using analytic tools such as Hive and Spark to get back some of the querying functionality you lose when compared to an RDBMS (and more, of course!). One way of thinking about this is that architecture is allowing you to hook in your application at a lower level in the stack than an RDBMS and thus getting you the speed and reliability you need. The complex query engines are sitting as peers to your application rather than in between you and the core storage and retrieval engine.

So, if your application needs great reliability, scalability and performance, then it is clear that Cassandra offers compelling advantages over a traditional RDBMS. However, as with most technologies, it will be important to understand it thoroughly and architect your application to work according to the expected patterns of the technology.

As I said in the introduction, these are my thoughts as I start to get my head around Cassandra and I’d love to get some feedback as I take this journey with my team here at Instaclustr.

Our aim is to be so good at running Cassandra that you’d be nuts to do it yourself and we believe that contributing to the Cassandra community is an important part of achieving that aim.

Ben Slater

Chief Product Officer at Instaclustr

[email protected]

Twitter: https://twitter.com/slater_ben

Ben Slater

OpenSearch® Versions 2.14 and 1.3.17 Now Available

Powering AI Workloads with Intelligent Data Infrastructure and Open Source

Instaclustr for ClickHouse® now in Private Preview