In this post, we’ll look at some of the key differences between Apache Cassandra (hereafter just Cassandra) and DynamoDB.
Both are distributed databases and have similar architecture, and both offer incredible scalability, reliability, and resilience. However, there are also differences, and understanding the differences and cost benefits can help you determine the right solution for your application.
Apache Cassandra is an open source database available at no cost from The Apache Foundation. Installing and configuring Cassandra can be challenging and there is more than one pitfall along the way. However, Cassandra can be installed on any cloud service or at a physical location you choose.
The typical Cassandra installation is a cluster which is a collection of nodes (a node is a single instance of Cassandra installed on a computer or in a Docker container). Nodes can then be grouped in racks and data centers which can be in different locations (cloud zones and regions or physical collocations). You must scale Cassandra as your demand grows and are responsible for the ongoing management tasks such as backups, replacing bad nodes, or adding new nodes to meet demand.
Amazon DynamoDB is a fully managed database as a service. All implementation details are hidden and from the user viewpoint, DynamoDB is serverless. DynamoDB automatically scales throughput capacity to meet workload demands, and partitions and repartitions your data as your table size grows and distributes data across multiple availability zones. However, the service is available only through Amazon Web Services (AWS).
Replica Configuration and Placement
NoSQL data stores like Cassandra and DynamoDB use multiple replicas (copies of data) to ensure high availability and durability. The number of replicas and their placement determines the availability of your data.
With Cassandra, the number of replicas to have per cluster—the replication factor—and their placement is configurable. A cluster can be subdivided into two or more data centers which can be located in different cloud regions or physical collocations. The nodes in a data center can be assigned to different racks that can be assigned to different zones or to different physical racks.
In contrast, with DynamoDB, Amazon makes these decisions for you. By default, data is located in a single region and is replicated to three (3) availability zones in that region. Replication to different AWS regions is available as an option with Global Tables. DynamoDB Streams must be enabled for multi-region replication.
The top level data structure in Cassandra is the keyspace which is analogous to a relational database. The keyspace is the container for the tables and it is where you configure the replica count and placement. Keyspaces contain tables (formerly called column families) composed of rows and columns. A table schema must be defined at the time of table creation.
The top level structure for DynamoDB is the table which has the same functionality as the Cassandra table. Rows are items, and cells are attributes. In DynamoDB, it’s possible to define a schema for each item, rather than for the whole table.
Both tables store data in sparse rows—for a given row, they store only the columns present in that row. Each table must have a primary key that uniquely identifies rows or items. Every table must have a primary key which has two components:
- A partition key that determines the placement of the data by subsetting the table rows into partitions. This key is required.
- A key that sorts the rows within a partition. In Cassandra, this is called the clustering key while DynamoDB calls it the sort key. This key is optional.
Taken together, the primary key ensures that each row in a table is unique. However, neither enforces unique constraints. Inserting a row with a duplicate primary key overwrites the previous row.
- DynamoDB limits the number of tables in an AWS region to 256. If you need more tables, you must contact AWS support. There are no hard limits in Cassandra, but the practical limit is around 500 tables.
- DynamoDB is schemaless. Only the primary key attributes need to be defined at table creation.
- DynamoDB charges for read and write throughput and requires you to manage capacity for each table. Read and write throughput and associated costs must be considered when designing tables and applications.
- The maximum size of an item in DynamoDB is 400KB. With Cassandra, the hard limit is 2GB; the practical limit is a few megabytes.
- In DynamoDB, the primary key can have only one attribute as the primary key and one attribute as the sort key. Cassandra allows composite partition keys and multiple clustering columns.
- Cassandra supports counter, time, timestamp, uuid, and timeuuid data types not found in DynamoDB.
Allocating Table Capacity
Both Cassandra and DynamoDB require capacity planning before setting up a cluster. However, the approaches are different.
To create a performant Cassandra cluster, you must first make reasonably accurate estimates of your future workloads. Capacity is allocated by creating a good data model, choosing the right hardware, and properly sizing the cluster. Increasing workloads are met by adding nodes.
With DynamoDB, capacity planning is determined by the type of read/write capacity modes you choose. On demand capacity requires no capacity planning other than setting an upper limit on each table. You pay only for the read and write requests on the table. Capacity is measured in Read Resource Units and Write Resource Units. On demand mode is best when you have an unknown workload, unpredictable application traffic, or you prefer the ease of paying for only what you use.
With provisioned capacity, you must specify the number of reads and write throughput limits for each table at the time of creation. If you exceed these limits for a table or tables, DynamoDB will throttle queries until usage is below defined capacity. Auto-scaling will adjust your table’s provisioned capacity automatically in response to traffic changes although there is a lag between the time throttling starts and increased capacity is applied.
The throughput limits are provisioned in units called Read Capacity Units (RCU) and Write Capacity Units (WCU); queries are throttled whenever these limits are exceeded. One read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size. Transactional read requests require two read capacity units to perform one read per second for items up to 4 KB. If you need to read an item that is larger than 4 KB, DynamoDB must consume additional read capacity units. One write capacity unit represents one write per second for an item up to 1 KB in size. If you need to write an item that is larger than 1 KB, DynamoDB must consume additional write capacity units. Transactional write requests require 2 write capacity units to perform one write per second for items up to 1 KB. For more information, see Managing Settings on DynamoDB Provisioned Capacity Tables.
Provisioned mode is a good option if any of the following are true:
- You have predictable application traffic.
- Application traffic is consistent or ramps gradually.
- You can forecast capacity requirements to control costs.
Both Cassandra and DynamoDB group and distribute data based on the hashed value of the partition key. Both call these grouping partitions but they have very different definitions.
In Dynamo the partition is a storage unit that has a maximum size of 10 GB. When a partition fills, DynamoDB creates a new partition and the user has no control over the process. A partition can contain items with different partition key values. When a sort key is used, all the items with the same partition key value physically close together, ordered by sort key value.
DynamoDB partitions have capacity limits of 3,000 RCU or 1,000 WCU even for on-demand tables. Furthermore, these limits cannot be increased. If you exceed the partition limits, your queries will be throttled even if you have not exceeded the capacity of the table. See Throttling and Hot Keys (below) for more information.
A Cassandra partition is a set of rows that share the same hashed partition key value. Rows with the same partition key are stored on the same node. Rows within the partition are sorted by the clustering columns. If no clustering column was specified, the partition holds a single row. While it would not be desirable, it would be possible for an application to drive tens of thousands of reads/writes to a single partition.
See Cassandra Data Partitioning.
Cassandra provides a SQL-like language called Cassandra Query Language (CQL) to access data. DynamoDB uses JSON syntax. The following table shows the syntax for the query “return all information from the Music table for the song title ‘Lullaby of Broadway’ and the artist ‘Tommy Dorsey’”
|Request all information for the song ‘Lullaby of Broadway‘ played by Tommy Dorsey|
WHERE Artist=’Tommy Dorsey’ AND SongTitle = ‘Lullaby of Broadway‘
“Artist”: “Tommy Dorsey”,
“SongTitle”: “Lullaby of Broadway“
By default, Cassandra and DynamoDB queries can use only the primary key columns in the search condition which must include all partition key columns. Non-key columns can be used in a search by creating an index on that column.
Cassandra supports creating an index on most columns including a clustering column of a compound primary key or on the partition key itself. Creating an index on a collection or the key of a collection map is also supported. However, when used incorrectly a secondary index can hurt performance. A general rule of thumb is to index a column with low cardinality of few values and to use only with the partition key in the search clause. Because the index table is stored on each node in a cluster, a query using a secondary index can degrade performance if multiple nodes are accessed.
DynamoDB has local secondary indexes. This index uses the same partition key as the base table but has a different sort key. Scoped to the base table partition that has the same partition key value. Local secondary indexes must be created at the same time the table is created. A maximum of 5 local secondary indexes may be created per table.
Materialized Views versus Global Secondary Indexes
In Cassandra, a Materialized View (MV) is a table built from the results of a query from another table but with a new primary key and new properties. Queries are optimized by the primary key definition. The purpose of a materialized view is to provide multiple queries for a single table. It is an alternative to the standard practice of creating a new table with the same data if a different query is needed. Data in the materialized view is updated automatically by changes to the source table. However, the materialized view is an experimental feature and should not be used in production.
A similar object is DynamoDB is the Global Secondary Index (GSI) which creates an eventually consistent replica of a table. The GSI are created at table creation time and each table has a limit of 20. The GSI must be provisioned for reads and writes; there are storage costs as well.
Time To Live
Time To Live (TTL) is a feature that automatically removes items from your table after a period of time has elapsed.
- Cassandra specifies TTL as the number of seconds from the time of creating or updating a row, after which the row expires.
- In DynamoDB, TTL is a timestamp value representing the date and time at which the item expires.
- DynamoDB applies TTL at item level. Cassandra applies it to the column.