What is Apache Cassandra®?
Apache Cassandra is an open source non-relational, or NoSQL, database that enables continuous availability, tremendous scale, and data distribution across multiple data centers and cloud availability zones.
Simply put, Cassandra provides a highly reliable data storage engine for applications requiring immense scale.
Get to know Apache Cassandra better in this blog.
Open Source Apache Cassandra®
Apache Cassandra was originally developed at Facebook, and in 2008 it was released as an open source project on Google Code by the company. In 2010, it became a top-level Apache project.
The open source version of the Cassandra database is used by some of the largest technology companies in the world to run mission-critical applications. It is widely known that the largest deployment of the open source version of the Cassandra database is at Apple. Netflix is also a very large user of open source Apache Cassandra—the foundation for big data. It is estimated that Cassandra is deployed by over 50% of the Fortune 500 companies.
Cassandra’s Place in the NoSQL World
NoSQL includes a diverse range of technologies with specific NoSQL products suited to different use cases. NoSQL database technology was designed to overcome the limitations of RDBMS technology on data size, transaction throughput, scalability, reliability, and manageability, flexibility of data schema, and/or cost of hardware.
In the blog post “Surveying the Cassandra-compatible database landscape”, Ben Slater, CPO, Instaclustr shares details on a range of Cassandra-compatible offerings available in the market.
Cassandra and Availability at Scale
Cassandra has a number of core features and benefits that deliver the capability to massively scale, while still maintaining continuous and high availability without compromising performance.
Watch the YouTube video Cassandra Serving Netflix @ Scale – Vinay Chella, Netflix to see how Cassandra is serving Netflix with several millions of operations/sec with multiple nines of availability with 250+ Clusters, 10,000+ Nodes and 3+ PB of data deployment. Download our whitepaper “How to Maximize Availability With Apache Cassandra” to learn various strategies you could apply for your Cassandra deployment. In this white paper, you will learn the architectural, infrastructure, and application-level strategies.
How Cassandra Works
Cassandra has been designed with scale, performance, and continuous availability as the foundation architecture principles. Cassandra operates using a masterless ring architecture—it does not rely on a master-slave relationship.
In Cassandra, all nodes play an identical role; there is no concept of a master node, with all nodes communicating with each other via a distributed, scalable protocol. Writes are distributed among nodes using a hash function and reads are channelled onto specific nodes.
Cassandra stores data by dividing the data evenly around its cluster of nodes. Each node is responsible for part of the data. The act of distributing data across nodes is referred to as data partitioning.
Cassandra is a built-for-scale architecture, meaning that it is capable of handling large amounts of data and millions of concurrent users or operations per second—even across multiple data centers—as easily as it can manage much smaller amounts of data and user traffic. To add more capacity, you simply add new nodes to an existing cluster without having to take it down first. Unlike other master-slave or sharded systems, Cassandra has no single point of failure and therefore is capable of offering true continuous availability and uptime.
The key components of the Cassandra architecture include the following terms and concepts:
- Node: the specific instance where data is stored.
- Rack: a set of nodes with a correlated chance of failure.
- Datacenter: a collection of related nodes with a complete set of data.
- Cluster: a component that contains one or more data centers.
- Commit log: it is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
- Mem-table: a mem-table is a memory-resident data structure. A mem-table is a write-back cache residing in memory that has not been flushed to disk yet.
- SSTable: a Sorted String Table (SSTable) ordered immutable key value map. It is basically an efficient way of storing large sorted data segments in a file. You may also be interested to read Instaclustr Open Sources Cassandra sstable analysis tools and view our help page.
- Bloom filter: is an extremely fast way to test the existence of a data structure in a set. A bloom filter can tell if an item might exist in a set or definitely does not exist in the set. Bloom filters are a good way of avoiding expensive I/O operation.
Cassandra Data Modeling
Cassandra is wide column store database. Its data model is a partitioned row store with tunable consistency. Rows are organized into tables; the first component of a table’s primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns may be indexed separately from the primary key. Tables may be created, dropped, and altered at run-time without blocking updates and queries.
Cassandra cannot do joins or subqueries. Rather, Cassandra emphasizes denormalization through features like collections. A column family (called “table” since CQL3) resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time. Our white paper 6 Step Guide to Apache Cassandra Data Modeling sets out a methodical approach that we use to define a data model for our customers deploying open source Cassandra. You can read more about Data Modeling recommended practices on our support portal.
Apache Cassandra vs DynamoDB
One database which Cassandra is often compared with is the AWS DynamoDB. Both Cassandra and DynamoDB offer incredible scale and availability. They both can serve 10’s of millions of reads and writes and offer a level of resilience in the face of failure. Both technologies share a similar underlying architecture (Dynamo) but that is where the similarities end. They are different in so many ways.
In the blog post “Surveying the Cassandra-compatible database landscape”, Ben Slater, CPO, Instaclustr shares the range of Cassandra-compatible offerings available in the market. Download our white paper Apache Cassandra vs DynamoDB to understand the differences and identify the technology you should adopt for your unique use case.
Configuring and Operating Cassandra
The following are a number of blogs and good references that relate to configuring and operating Apache Cassandra.
Cassandra Broadcast Address
When configuring Cassandra to work in a new environment or with a new application or service we sometimes find ourselves asking about the difference between broadcast_address and broadcast_rpc_address”. Our blog attempts to demystify Cassandra broadcast address.
Apache Cassandra Compaction Strategies
It is equally important to understand Cassandra Compaction Strategies. While regular compactions are an integral part of any healthy Cassandra cluster, the way that they are configured can vary significantly depending on the way a particular table is being used. Our technical article gives you an in-depth look into Cassandra Compaction Strategies.
Apache Cassandra Tombstones
Multi-value data types are a powerful feature of Cassandra. However, some of Cassandra’s behaviour when handling these data types is not always as expected and can cause issues. In particular, there can be hidden surprises when you update the value of a collection type column. Our blog, Cassandra collections: hidden tombstones and how to avoid them digs deeper into this space.
Apache Cassandra Best Practices
The right deployment strategies and best practices for Apache Cassandra can mean the difference between on-time deployment of applications that scale massively, are always available, and perform blazingly fast, and those that bring your applications to a crawl.
We have an abundance of resources on our support portal to help you with creating your cluster.
Download white paper on Avoiding the Pitfall and Challenges of Cassandra Implementation to identify mistakes while implementing Cassandra for Big Data technology.
Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (keyspace) as a container of tables. CQL is a typed language and supports a rich set of data types, including native types, collection types, user-defined types, tuple types, and custom types.
Programmers use cqlsh—a prompt to work with CQL or separate application language drivers. Read our support article to understand how cqlsh can be used to connect to clusters in Instaclustr, and the blog Consulting Cassandra: Second Contact with the Monolith (CQLSH).
Planning to migrate to Cassandra? You need to keep a few things in mind, which include knowing when to consider migration, how to prepare your application, as well as having an understanding of migration approaches. Our CPO, Ben Slater’s presentation on migrating to Apache Cassandra is a great resource if you are considering migrating your cluster to Cassandra.
Deploying Apache Cassandra
The following are some key resources and information related to deploying Cassandra in the cloud or within your own private data center.
Cassandra and Multi-Data Center Clusters
One of the strongest features of Cassandra is its native support for the concept of multiple logical data centers within a cluster. Multi-data center clusters allow Cassandra to support several different scenarios. While at a high level, creating additional data centers in Cassandra is fairly straightforward, but in cross-region and cross-provider scenarios, you would need to dig deeper. Our CPO, Ben Slater, helps you to learn how Instaclustr has made Multi-Data center clusters easy.
We conducted benchmarking for multi-data center Apache Spark and Apache Cassandra. The aim of this benchmark study was to compare performances between one-data-center settings where Spark and Cassandra are collocated, versus two-data-center settings where Spark is running on the second data center.
Cassandra and Spark
Apache Spark usage goes back to Twitter, that used it as their data analytics solution, but it has become a full-blown Apache project for many years now. Apache Spark is a high performing engine for large-scale analytics and data processing. While Apache Spark provides advanced analytics capabilities, it requires a fast, distributed backend data store. Apache Cassandra is the most modern, reliable, and scalable choice for that data store. Spark when fully integrated with the key components of Cassandra, provides the resilience and scale required for big data analytics. Spark supports a rich set of higher-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.
One of the advantages of deploying Spark with Instaclustr is that it is a collocated data engine—it is right where your operational database resides, with no need for extracting, transforming, and loading into a new environment. Spark and Cassandra clusters are deployed to the same set of machines. Cassandra stores the data; Spark worker nodes are co-located with Cassandra and do the data processing. Spark is a batch-processing system, designed to deal with large amounts of data. When a job arrives, the Spark workers load data into memory, spilling to disk if necessary.
A blog post by our CPO Ben Slater outlines some of the solution patterns where it makes sense to use Spark Streaming alongside Cassandra.
Our tutorial on getting started with Instaclustr Spark and Cassandra is a good starting point to learn how to provision a cluster using Spark, Cassandra, and more.
During the initial days when we released the Cassandra + Spark managed service offering, we have had opportunities to dig deeper into using the Cassandra connector for Spark, both with our own Instametrics application and while assisting customers with developing and troubleshooting. During this process, we’ve learnt a few key lessons about how to get the best out of the Cassandra connector for Spark, check out the 5-easy tips.
Cassandra on AWS
Cassandra on AWS EBS Infrastructure
Traditionally it was believed that Cassandra and AWS EBS don’t mix. However, with the release of the latest generation EBS-optimized instances, this belief has changed, and we now know people have had success using these nodes to run Cassandra. In his blog post, Ben answers many questions around Cassandra on AWS EBS infrastructure and the Cost of Cassandra on AWS.
AWS-Lambda and Cassandra
AWS-Lambda is a simple way to execute a small portion of stateless code, on-demand, without the need to provide any servers. AWS Lambda is often combined with AWS API gateway to provide the front end and execution layer of a REST API. Check out our presentation Cassandra + Lambdascale POC to walk through a POC that combines AWS Lambda, API Gateway, and Instaclustr Apache Cassandra Managed Service to power a simple REST API.
A VPC peering connection is a networking connection between two VPCs that enables you to route traffic between them privately. Instaclustr supports VPC peering as a mechanism for connecting directly to your Instaclustr managed cluster. VPC Peering allows you to access your cluster via private IP and results in a much more secure network setup. View our support page on using VPC Peering.
AWS R4 Instances Type
R4 instances are the next generation of Amazon EC2 Memory Optimized instances. R4 instances are well-suited for memory-intensive, latency-sensitive workloads like business intelligence (BI), data mining and analysis, in-memory databases, distributed web scale in-memory caching, and applications performing real-time processing of unstructured big data. We conducted Cassandra benchmarking of the R4 type against our existing M4 offerings and found significant performance improvements running fairly IO-intensive mixed workloads. Know more.
Cassandra on Azure
Apache Cassandra Kubernetes
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It has a large and rapidly growing ecosystem and the services, support, and tools are also widely available.
Ben Bromhead, CTO, Instaclustr, in his presentation introduces Cassandra Kubernetes Operator, a Cassandra controller that provides robust, managed Cassandra deployments on Kubernetes.
Cassandra and Multi-Cloud
Today companies want to be capable of being cloud agnostic and not beholden to one single vendor, so this has made multi-cloud deployment highly desirable. But, simply moving to the cloud is hard enough.
Cassandra and Data Layer Technologies
Cassandra is a database technology, providing the data store for an application or solution. While the data storage mechanism forms an incredibly important part of the data layer, there are other relevant technologies that can be integrated and used.
However, it only forms one part of the data layer, with a range of other core open source technologies that can be effectively integrated to provide a more complete data layer solution. The DbaaS is moving away from the database and is including the data layer components that interact with the database, such as integrated data software and related infrastructure.
The Instaclustr Managed Platform provides an integrated data layer with the following complementary open source technologies.
Managed Cassandra Database
Cassandra is the database of choice for scalability, highly available, reliable, and high-performance applications. Instaclustr Managed Service for Apache Cassandra gets you up and running quickly, and is the most reliable way to run Cassandra for your application. We are so confident in the performance of our clusters that we include latency and performance guarantees in our contracted SLAs. You can enjoy our hosted and fully managed Apache Cassandra on AWS, Azure, GCP, IBM cloud, or in your own private data center with 24×7 support.
Our Managed Cassandra comes with add ons:
Apache Lucene: The Cassandra Lucene Index plugin expands Cassandra’s native secondary index to perform comprehensive search functionality through multivariable, geospatial, and bi-temporal search capabilities. Cassandra Lucene Index resides right where your operational database resides, thus, no need for extracting, transforming, and loading into a new environment.
Cassandra Database as a Service
Exploring Cassandra as a Service? Our white paper “The Unmatchable ROI of Managed Cassandra Service” will take you through the 3 key points you need to consider when deciding between building your own Cassandra competency center or outsourcing to an expert Cassandra service provider.
We have extensive experience in Apache Cassandra Consulting helping our customers develop and deploy high performance and continually available solutions.
We offer a wide range of Consulting Service Packages that will help you take advantage of our expertise in open source, and be guided by our team of experts
Apache Cassandra Support
At Instaclustr a dedicated team of technology and operational experts deliver support for Apache Cassandra 24×7.
We provide support for all Cassandra database use cases as well as complimentary open source technologies across various industries. We have gained a wealth of experience helping new companies to disrupt, and mature companies looking to transform their business.
Watch a short video on Instaclustr support.
Deploy Production Ready Certified Cassandra
Following a certification process across several critical variables, enterprises can build applications with even greater confidence.
The Certification framework provides increased assurance that specific releases of Apache Cassandra have been tested for a range of functional, performance, and integration properties prior to being enabled on the Instaclustr Managed Platform.