Apache Cassandra

The database of choice for scalable, highly available, reliable and high performance applications.

Instaclustr Managed Apache Cassandra details

What is Apache Cassandra®?

Apache Cassandra® is an open source non-relational, or NoSQL, database that enables continuous availability, tremendous scale and data distribution across multiple data centers and cloud availability zones.

Simply put, Cassandra provides a highly reliable data storage engine for applications requiring immense scale.

Get to know Cassandra better in this blog.

Open Source Apache Cassandra®

The Cassandra database was originally developed at Facebook and in 2008, it was released as an open-source project on Google Code by the company. In 2010, it became a top-level Apache project.

The open source version of the Cassandra database is used by some of the largest technology companies in the world to run mission-critical applications. It is widely known that the largest deployment of the open source version of the Cassandra database is at AppleNetflix is also a very large user of open source Apache Cassandra – the foundation for big data. It is estimated that the Cassandra database is deployed in over 50% of the Fortune 500 companies.

Learn more about the health of Apache Cassandra community.

To know more about open source technologies and benefits of open source Cassandra, view our webinar “Power of the Open Source”. The webinar is a great resource to understand the pitfall of proprietary technologies.

Cassandra’s place in the NoSQL world

NoSQL includes a diverse range of technologies with specific NoSQL products suited to different use cases. NoSQL database technology was designed to overcome the limitations of RDBMS technology on data size, transaction throughout, scalability, reliability and manageability, flexibility of data schema, and/or cost of hardware.

Our CPO Ben Slater provides an understanding on where Cassandra fits in the NoSQL world as well Cassandra’s ecosystem.

In the blog post “Surveying the Cassandra-compatible database landscape”, Ben Slater, CPO, Instaclustr shares details on a range of Cassandra-compatible offerings available in the market.

Cassandra and Availability at Scale

The Cassandra database has a number of core features and benefits that deliver the capability to massively scale, while still maintaining continuous and high availability and also without compromising performance.

Watch the youtube video Cassandra Serving Netflix @ Scale – Vinay Chella, Netflix to see how Cassandra is serving Netflix with several Millions of operations/sec with multiple nines of availability with 250+ Clusters, 10,000+ Nodes and 3+ PB of data deployment.

Download our whitepaper “How to maximize availability with Apache Cassandra”  to learn various strategies you could apply for your Cassandra deployment. In this white paper you would learn architectural, infrastructure and application-level strategies.

Apache Cassandra Architecture

The Cassandra database has been designed with scale, performance and continuous availability as the foundation architecture principles.  Cassandra operates using a masterless ring architecture – it does not rely on a master-slave relationship.

In Cassandra, all nodes play an identical role; there is no concept of a master node, with all nodes communicating with each other via a distributed, scalable protocol. Writes are distributed among nodes using a hash function and reads are channeled onto specific nodes.

Cassandra stores data by dividing the data evenly around its cluster of nodes. Each node is responsible for part of the data. The act of distributing data across nodes is referred to as data partitioning.

Cassandra’s built-for-scale architecture means that it is capable of handling large amounts of data and millions of concurrent users or operations per second—even across multiple data centers—as easily as it can manage much smaller amounts of data and user traffic. To add more capacity, you simply add new nodes to an existing cluster without having to take it down first.   Unlike other master-slave or sharded systems, Cassandra has no single point of failure and therefore is capable of offering true continuous availability and uptime.

The key components of the Cassandra architecture include the following terms and concepts:

  • Node − The specific instance where data is stored.
  • Rack – a set of nodes with a correlated chance of failure.
  • Data center − Collection of related nodes with a complete set of data.
  • Cluster − A component that contains one or more data centers.
  • Commit log −It is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
  • Mem-table − A mem-table is a memory-resident data structure. A mem-table is a write back cache residing in memory which has not been flushed to disk yet.
  • SSTable − A Sorted String Table (SSTable) ordered immutable key value map. It is basically an efficient way of storing large sorted data segments in a file. You may also be interested to read Instaclustr Open Sources Cassandra sstable analysis tools and view our help page.
  • Bloom filter − is an extremely fast way to test the existence of a data structure in a set. A bloom filter can tell if an item might exist in a set or definitely does not exist in the set. Bloom filters are a good way of avoiding expensive I/O operation.

Cassandra Data Modelling

Cassandra is wide column store database. Its data model is a partitioned row store with tunable consistency.  Rows are organized into tables; the first component of a table’s primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns may be indexed separately from the primary key. Tables may be created, dropped, and altered at run-time without blocking updates and queries.

Cassandra cannot do joins or subqueries. Rather,Cassandra emphasizes denormalization through features like collections. A column family (called “table” since CQL3) resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.

Our white paper 6 Step Guide to Apache Cassandra Data Modelling sets out a methodical approach that we use to define a data model for our customers deploying open source Cassandra. You can read more about Data Modelling recommended practices on our support portal.

Apache Cassandra vs DynamoDB

One database with Cassandra is often compared with is the AWS DynamoDB. Both Cassandra and DynamoDB offers incredible scale and availability. They both can serve 10’s of millions of reads and writes and offer a level resilience in the face of failure. Both technologies share a similar underlying architecture (Dynamo) but that is where the similarities end. Both are different in so many ways.

In the blog post “Surveying the Cassandra-compatible database landscape”, Ben Slater, CPO, Instaclustr shares the range of Cassandra-compatible offering available in the market.

Download our white paper Apache Cassandra vs DynamoDB to understand the differences and identify the technology you should adopt for your unique use case.

Configuring and Operating Cassandra

The following are a number of blogs and good references that relate to configuring and operating the Apache Cassandra database.

Cassandra Broadcast Address

When configuring Cassandra to work in a new environment or with a new application or service we sometimes find ourselves asking the difference between broadcast_address and broadcast_rpc_address”. Our blog attempts to demystify Cassandra broadcast address.

Apache Cassandra Compaction Strategies

It is equally important to understand the Cassandra Compaction Strategies. While regular compactions are an integral part of any healthy Cassandra cluster – the way that they are configured can vary significantly depending on the way a particular table is being used. Our technical article give you an in-depth into Cassandra Compaction Strategies.

Apache Cassandra Tombstones

Multi-value data types are a powerful feature of Cassandra. However, some of Cassandra’s behaviour when handling these data types is not always as expected and can cause issues. In particular, there can be hidden surprises when you update the value of a collection type column. Our blog, Cassandra collections: hidden tombstones and how to avoid them digs deeper in this space.

Apache Cassandra Best Practices

The right deployment strategies and best practices for Apache Casasandra can mean the difference between on-time deployment of applications that scale massively, are always available, and perform blazingly fast, and those that bring your applications to a crawl.

Download white paper on Avoiding the Pitfall and Challenges of Cassandra Implementation to identify mistakes while while implementing Cassandra database for Big data technology.

We have tons of resource on our support portal to help you with creating your cluster.

Apache Cassandra CQL

Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (keyspace) as a container of tables. CQL is a typed language and supports a rich set of data types, including native types, collection types, user-defined types, tuple types and custom types.

Programmers use cqlsh: a prompt to work with CQL or separate application language drivers. Read our support article to understand how cqlsh can be used to connect to clusters in Instaclustr, and the blog Consulting Cassandra: Second Contact with the Monolith (CQLSH).

Migrating to Cassandra

Planning to migrate to Cassandra, you need to keep few things in mind which include knowing when to consider migration, thereafter preparing your application as well as having understanding about migration approaches.

Our CPO, Ben Slater presentation on migrating to Apache Cassandra is a great resource if you are considering migrating your cluster to Cassandra.

Deploying Apache Cassandra

Following are some key resources and information related to deploying Cassandra in the cloud or within your own private datacenter.

Cassandra and Multi-Datacenter Clusters

One of the the strongest features of Cassandra is its native support for the concept of multiple logical data centres within a cluster. Multi-datacenter clusters allow Cassandra to support several different scenarios. While at a high level, creating additional data centers in Cassandra is a fairly straightforward but in cross-region and cross-provider scenarios you would need to dig deeper. Our CPO, Ben Slater, would help you learn how instaclustr has made Multi-Datacenter cluster easy.
We conducted benchmarking for multi data center Apache Spark and Apache Cassandra. The aim of this benchmark study was to compare performances between one-data-center setting where Spark and Cassandra are collocated, versus two-data-center setting where Spark is running on the second data center.

Download the presentation, Introduction to Managing Apache Cassandra. The presentation by Brooke Thorley, VP Technical Operations & Customer Services, Instaclustr provides an introduction to managing Apache Cassandra. If you are new to Cassandra, this presentation will help clear any doubts as you learn tricks used by experts in managing Cassandra.Using Cassandra, but dealing with high severity incidents in unknown environments in a Cassandra cluster, you may find the presentation Apache Cassandra consulting and firefighting useful.

Cassandra and Spark

Apache Spark usages goes back to Twitter that used it as their data analytics solution, but it has become a full-blown Apache project for many years now. Apache Spark is a high performing engine for large-scale analytics and data processing, While Apache Spark provides advanced analytics capabilities, it requires a fast, distributed backend data store. Apache Cassandra is the most modern, reliable and scalable choice for that data store. Spark when fully integrated with the key components of Cassandra, provides the resilience and scale required for big data analytics. Spark supports a rich set of higher level tools including Spark SQL, MLlib, GraphX and Spark Streaming.

One of the advantages with deploying Spark with Instaclustr is that it is collocated data engine – it is right where your operational database resides, no need for extracting, transforming and loading into a new environment. Spark and Cassandra clusters are deployed to the same set of machines. Cassandra stores the data; Spark worker nodes are co-located with Cassandra and do the data processing. Spark is a batch-processing system, designed to deal with large amounts of data. When a job arrives, the Spark workers load data into memory, spilling to disk if necessary.

A blogpost by our CPO Ben Slater, outlines some of the solution patterns where it makes sense to use Spark Streaming alongside Cassandra.

Ben Bromhead, CTO, Instaclustr takes a deep dive on how Spark and Cassandra can be used together in his presentation “Processing 200K Transactions per Second with Apache Spark and Apache Cassandra

Our tutorial on getting started with Instaclustr Spark and Cassandrais good starting point to learn to provision a cluster using Spark, Cassandra and more.

Our technology evangelist, Paul Brebner, wrote an introductory “2001 Space Odyssey themed” series on using Cassandra, Spark and Zeppelin for Big Data Predictive Analytics (Machine Learning over Instaclustr’s Instametrics Cassandra cluster monitoring data):

The final blog in the series covers Spark Streaming: Apache Spark Structured Streaming with DataFrames

During the initial days when we released Cassandra + Spark managed service offering, we have had opportunities to dig deeper about using the Cassandra connector for Spark, both with our own Instametrics application and while assisting customers with developing and troubleshooting. During this process, we’ve learnt a few key lessons about how to get the best out of the Cassandra connector for Spark, check out the 5-easy tips.

Apache Cassandra on AWS

Cassandra on AWS EBS Infrastructure

Traditionally it was believed, that Cassandra and AWS EBS don’t mix. However, with the release of the latest generation EBS-optimized instances this belief has changed and we now know people have success using these nodes to run Cassandra. In his blog post, Ben answer many questions around Cassandra on AWS EBS infrastructure and Cost of Cassandra on AWS.

AWS-Lambda and Cassandra

AWS-Lambda is a simple way to execute a small portion of stateless code, on demand, without the need to provide any servers. AWS Lambda is often combined with AWS API gateway to provide the front end and execution layer of a REST API.

Check out our presentation Cassandra + Lambdascale POC to walks through a POC that combines AWS Lambda, API Gateway and Instaclustr Apache Cassandra Managed Service, to power a simple REST API.

VPC Peering

A VPC peering connection is a networking connection between two VPCs that enables you to route traffic between them privately. Instaclustr supports VPC peering as a mechanism for connecting directly to your Instaclustr managed cluster. VPC Peering allows you to access your cluster via private IP and makes for a much more secure network setup. View our support page on using VPC Peering.

AWS R4 Instances Type

R4 instances, is the next generation of Amazon EC2 Memory Optimized instances. R4 instances are well-suited for memory-intensive, latency-sensitive workloads like business intelligence (BI), data mining & analysis, in-memory databases, distributed web scale in-memory caching, and applications performing real-time processing of unstructured big data. We conducted Cassandra benchmarking of the R4 type against our existing M4 offerings and found significant performance improvements running a fairly IO intensive mixed workloads. Know more.

Apache Cassandra Kubernetes

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It has a large and rapidly growing ecosystem and the services, support, and tools are also widely available.

Ben Bromhead, CTO, Instaclustr, in his presentation introduces Cassandra Kubernetes Operator, a Cassandra controller that provides robust, managed Cassandra deployments on Kubernetes.

Cassandra and Multi-Cloud

Today companies want to be capable of being cloud agnostic and not be hold on to one single vendor and this has made multi-cloud deployment highly desirable. But, simply moving to the cloud is hard enough. Our white paper on “Why choosing Apache Cassandra is planning for a multi cloud future”  outlines 5 reasons that makes Apache Cassandra an enabler for true multi-cloud deployments.

Cassandra and the Data Layer

Cassandra is a database technology, providing the data store for an application or solution.  While the data storage mechanism forms an incredibly important part of the data layer, there are other relevant technologies that can be integrated and used.

However, it only forms one part of the data layer, with a range of other core open source technologies that can be effectively integrated to provide a more complete data layer solution. The DbaaS is moving away from the database and are including the data layer components that interact with the database – such as integrated data software and related infrastructure.

The Instaclustr Managed Platform provides an integrated data layer with the following complementary open source technologies.


The “Pick‘n’Mix: Cassandra, Spark, Zeppelin, Elassandra, Kibana, & Kafka” blog looks at possible ways of using these technologies together.

Managed Cassandra

Cassandra is the database of choice for scalability, highly available, reliable and high performance applications. Instaclustr Managed Service for Apache Cassandra gets you up and running quickly and is the most reliable way to run Cassandra for your application. We are so confident in the performance of our clusters that we include latency and performance guarantees in our contracted SLAs.  You can enjoy our hosted and fully managed Apache Cassandra on AWS, Azure, GCP, IBM cloud or in your own private data center with 24/7 support.

Our Managed Cassandra comes with add ons –

Apache Lucene: The Cassandra Lucene Index plugin expands Cassandra’s native secondary index to perform comprehensive search functionality though Multivariable, Geospatial and Bi-temporal Search capabilities. Cassandra Lucene Index resides right where your operational database resides, thus, no need for extracting, transforming and loading into a new environment.

Apache Zeppelin: Apache Zeppelin provides a notebook user interface to allow interactive development and execution of code against both Cassandra and Spark along with data visualization capabilities. Zeppelin gives you an interactive analytics environment to start querying data in your Cassandra database or running complex analytics using Apache Spark as soon as your cluster is provisioned.

Cassandra-as-a-service

Exploring for Cassandra-as-a- Service, download our white paper “Managing Reliability at Scale” that gives a big picture on engaging a managed service provider (MSP) to manage your database and would help you understand why MSP is more than just having someone to manage your database.

Our second white paper “The Unmatchable ROI of Managed Cassandra Service” would take you through 3 key points you need to consider when deciding between building your own Cassandra competency center or outsourcing to an expert Cassandra service provider.

Cassandra Consulting

We have extensive experience in Apache Cassandra Consulting helping our customers develop and deploy high performance and continually available solutions.

We offer a wide range of Consulting Service Packages that will help you take advantage of our expertise in open-source, and be guided by our team of experts

Talk to our consultant

Site by Swell Design Group