Technical Technical — Cassandra Monday 24th August 2020

Apache Cassandra vs DynamoDB

By Lewis DiFelice

In this post, we’ll look at some of the key differences between Apache Cassandra (hereafter just Cassandra) and DynamoDB.

Both are distributed databases and have similar architecture, and both offer incredible scalability, reliability, and resilience. However, there are also differences,  and understanding the differences and cost benefits can help you determine the right solution for your application.

Apache Cassandra is an open source database available at no cost from the Apache Foundation. Installing and configuring Cassandra can be challenging and there is more than one pitfall along the way. However, Cassandra can be installed on any cloud service or at a physical location you choose.

The typical Cassandra installation is a cluster which is a collection of nodes (a node is a single instance of Cassandra installed on a computer or in a Docker container). Nodes can then be grouped in racks and data centers which can be in different locations (cloud zones and regions or physical collocations). You must scale Cassandra as your demand grows and are responsible for the ongoing management tasks such as backups, replacing bad nodes, or adding new nodes to meet demand.

Amazon DynamoDB is a fully managed database as a service. All implementation details are hidden and from the user viewpoint DynamoDB is serverless. DynamoDB automatically scales throughput capacity to meet workload demands, and partitions and repartitions your data as your table size grows, and distributes data across multiple availability zones. However, the service is available only through Amazon Web Services (AWS).

Replica Configuration and Placement

NoSQL data stores like Cassandra and DynamoDB use multiple replicas (copies of data) to ensure high availability and durability. The number of replicas and their placement determines the availability of your data.

With Cassandra, the number of replicas to have per cluster—the replication factor—and their placement is configurable. A cluster can be subdivided into two or more data centers which can be located in different cloud regions or physical collocations. The nodes in a data center can be assigned to different racks that can be assigned to different zones or to different physical racks.

In contrast, with DynamoDB, Amazon makes these decisions for you. By default, data is located in a single region and is replicated to three (3) availability zones in that region. Replication to different AWS regions is available as an option. Amazon streams must be enabled for multi-region replication.

Data Model

The top level data structure in Cassandra is the keyspace which is analogous to a relational database. The keyspace is the container for the tables and it is where you configure the replica count and placement. Keyspaces contain tables (formerly called column families) composed of rows and columns. A table schema must be defined at the time of table creation.

The top level structure for DynamoDB is the table which has the same functionality as the Cassandra table. Rows are items, and cells are attributes. In DynamoDB, it’s possible to define a schema for each item, rather than for the whole table.

Both tables store data in sparse rows—for a given row, they store only the columns present in that row. Each table must have a primary key that uniquely identifies rows or items. Every table must have a primary key which has two components: 

  • A partition key that determines the placement of the data by subsetting the table rows into partitions. This key is required.
  • A key that sorts the rows within a partition. In Cassandra, this is called the clustering key while DynamoDB calls it the sort key. This key is optional.

Taken together, the primary key ensures that each row in a table is unique. 

 Differences 

  • DynamoDB limits the number of tables in an AWS region to 256.  If you need more tables, you must contact AWS support. There are no hard limits in Cassandra. The practical limit is around 500 tables.
  • DynamoDB is schemaless. Only the primary key attributes need to be defined at table creation. 
  • DynamoDB charges for read and write throughput and requires you to manage capacity for each table. Read and write throughput and associated costs must be considered when designing tables and applications. 
  • The maximum size of an item in DynamoDB is 400KB. With Cassandra, the hard limit is 2GB; the practical limit is a few megabytes.
  • In DynamoDB, the primary key can have only one attribute as the primary key and one attribute as the sort key. Cassandra allows composite partition keys and multiple clustering columns.
  • Cassandra supports counter, time, timestamp, uuid, and timeuuid data types not found in DynamoDB.

Allocating Table Capacity

Both Cassandra and DynamoDB require capacity planning before setting up a cluster. However, the approaches are different. 

To create a performant Cassandra cluster, you must first make reasonably accurate estimates of your future workloads. Capacity is allocated by creating a good data model, choosing the right hardware, and properly sizing the cluster. Increasing workloads are met by adding nodes.

With DynamoDB, capacity planning is determined by the type of the read/write capacity modes you choose. On demand capacity requires no capacity planning other than setting an upper limit on each table. You pay only for the read and write requests on the table. Capacity is measured in Read Resource Units and Write Resource Units. On demand mode is best when you have an unknown workload, unpredictable application traffic, or you prefer the ease of paying for only what you use.

With provisioned capacity, you must specify the number of reads and write throughput limits for each table at the time of creation. If you exceed these limits for a table or tables, DynamoDB will throttle queries until usage is below defined capacity. Auto-scaling will adjust your table’s provisioned capacity automatically in response to traffic changes although there is a lag between the time throttling starts and increased capacity is applied.   

The throughput limits are provisioned in units called Read Capacity Units (RCU) and Write Capacity Units (WCU); queries are throttled whenever these limits are exceeded. One read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size. Transactional read requests require two read capacity units to perform one read per second for items up to 4 KB. If you need to read an item that is larger than 4 KB, DynamoDB must consume additional read capacity units. One write capacity unit represents one write per second for an item up to 1 KB in size. If you need to write an item that is larger than 1 KB, DynamoDB must consume additional write capacity units. Transactional write requests require 2 write capacity units to perform one write per second for items up to 1 KB. For more information, see Managing Settings on DynamoDB Provisioned Capacity Tables.

Provisioned mode is a good option if any of the following are true: 

  • You have predictable application traffic.
  • Application traffic is consistent or ramps gradually.
  • You can forecast capacity requirements to control costs.

Partitions

Both Cassandra and DynamoDB group and distribute data based on the hashed value of the partition key. Both call these grouping partitions but they have very different definitions.

In Dynamo the partition is a storage unit that has a maximum size of 10 GB. When a partition fills, DynamoDB creates a new partition and the user has no control over the process. A partition can contain items with different partition key values. When a sort key is used, all the items with the same partition key value physically close together, ordered by sort key value.

DynamoDB partitions have capacity limits of 3,000 RCU or 1,000 WCU even for on-demand tables. Furthermore, these limits cannot be increased. If you exceed the partition limits, your queries will be throttled even if you have not exceeded the capacity of the table. See Throttling and Hot Keys (below) for more information.

A Cassandra partition is a set of rows that share the same hashed partition key value.  Rows with the same partition key are stored on the same node. Rows within the partition are sorted by the clustering columns. If no clustering column was specified, the partition holds a single row. While it would not be desirable, it would be possible for an application to drive tens of thousands of reads/writes to a single partition. 

See Cassandra Data Partitioning.

Query Language 

Cassandra provides a SQL-like language called Cassandra Query Language (CQL) to access data. DynamoDB uses JSON syntax. The following table shows the syntax for the query “return all information from the Music table for the song title ‘Lullaby of Broadway’ and the artist ‘Tommy Dorsey’”

CQLDynamoDB
Request all information for the song  ‘Lullaby of Broadway‘ played by Tommy Dorsey

SELECT *

FROM Music

WHERE Artist=’Tommy Dorsey’ AND SongTitle = ‘Lullaby of Broadway

get-item {

    TableName: “Music”,

    Key: {

        “Artist”: “Tommy Dorsey”,

        “SongTitle”: “Lullaby of Broadway

    }

} 


Secondary Indexes

By default, Cassandra and DynamoDB queries can use only the primary key columns in the search condition which must include all partition key columns. Non-key columns can be used in a search by creating an index on that column.

Cassandra supports creating an index on most columns including a clustering column of a compound primary key or on the partition key itself. Creating an index on a collection or the key of a collection map is also supported. However, when used incorrectly a secondary index can hurt performance. A general rule of thumb is to index a column with low cardinality of few values and to use only with the partition key in the search clause. Because the index table is stored on each node in a cluster, a query using a secondary index can degrade performance if multiple nodes are accessed. 

DynamoDB has local secondary indexes. This index uses the same partition key as the base table but has a different sort key. Scoped to the base table partition that has the same partition key value. Local secondary indexes must be created at the same time the table is created. A maximum of 5 local secondary indexes may be created per table. 

Materialized Views versus Global Secondary Indexes 

In Cassandra, a Materialized View (MV) is a table built from the results of a query from another table but with a new primary key and new properties. Queries are optimized by the primary key definition. The purpose of a materialized view is to provide multiple queries for a single table. It is an alternative to the standard practice of creating a new table with the same data if a different query is needed. Data in the materialized view is updated automatically by changes to the source table. However, the materialized view is an experimental feature and should not be used in production.

A similar object is DynamoDB is the Global Secondary Index (GSI) which creates an eventually consistent replica of a table. The GSI are created at table creation time and each table has a limit of 20. The GSI must be provisioned for reads and writes; there are storage costs as well.

Time To Live

Time To Live (TTL) is a feature that automatically removes items from your table after a period of time has elapsed.

  • Cassandra specifies TTL as the number of seconds from the time of creating or updating a row, after which the row expires.
  • In DynamoDB, TTL is a timestamp value representing the date and time at which the item expires.
  • DynamoDB applies TTL at item level. Cassandra applies it to the column.

Consistency

Both Cassandra and DynamoDB are distributed data stores.  In a distributed system there is a tradeoff between consistency—every read receives the most recent write or an error, and availability—every request receives a (non-error) response but without the guarantee that it contains the most recent write. In such a system there are two levels possible levels of consistency:

  • Eventual consistency. This implies that all updates reach all replicas eventually. A read with eventual consistency may return stale data until all replicas are reconciled to a consistent state.
  • Strong consistency returns up-to-date data for all prior successful writes but at the cost of slower response time and decreased availability.

DynamoDB supports eventually consistent and strongly consistent reads on a per query basis. The default is eventual consistency. How it is done is hidden from the user.

 Strongly consistent reads in DynamoDB have the following issues: 

  • The data might not be available if there is a network delay or outage.
  • The operation may have higher latency.
  • Strongly consistent reads are not supported on global secondary indexes.
  • Strongly consistent reads use more throughput capacity than eventually consistent reads and therefore is more expensive. See Throttling and Hot Keys (below).

Cassandra offers tunable consistency for any given read or write operation that is configurable by adjusting consistency levels. The consistency level is defined as the minimum number of Cassandra nodes that must acknowledge a read or write operation before the operation can be considered successful. You are able to configure strong consistency for both reads and writes at a tradeoff of increasing latency.

Conflicts Resolution

In a distributed system, there is the possibility that a query may return inconsistent data from the replicas. Both Cassandra and DynamoDB resolve any inconsistencies with a “last write wins” solution but with Cassandra, every time a piece of data is written to the cluster, a timestamp is attached. Then, when Cassandra has to deal with conflicting data, it simply chooses the data with the most recent timestamp.

For DynamoDB, “last write wins” applies only to global tables and strongly consistent reads.

Security Features

Cassandra and DynamoDB provide methods for user authentication and authorization and data access permissions  Both use encryption for client and inter-node communication. DynamoDB also offers encryption at rest. Instaclustr offers encryption at rest for its Cassandra Managed Services.

Performance Issues

Consistency and Read Speed

Choosing strong consistency requires more nodes to respond to a request which increases latency.

Scans

Scans are expensive for both systems. Scans in Cassandra are slow because Cassandra has to scan all nodes in the cluster. Scans are faster in DynamoDB but are expensive because resource use is based on the amount of data read not returned to the client. If the scan exceeds your provisioned read capacity, DynamoDB will generate errors.

DynamoDB’s Issues

Auto Scaling

Amazon DynamoDB auto scaling uses the AWS Application Auto Scaling Service to dynamically adjust provisioned throughput capacity on your behalf, in response to actual traffic patterns. This enables a table or a global secondary index to increase its provisioned read and write capacity to handle sudden increases in traffic, without throttling. When the workload decreases, application auto scaling decreases the throughput so that you don’t pay for unused provisioned capacity.

  • Auto scaling does not work well with varying and bursty workloads. The table will scale up only based on consumption, triggering these alarms time after time until it reaches the desired level.
  • There can be a lag (5-15 minutes) between the time capacity is exceeded and autoscaling takes effect.
  • Capacity decreases are limited. The maximum is 27 per day (A day is defined according to the GMT time zone).
  • Tables scale based on the number of requests made, not the number of successful requests.
  • Autoscaling cannot exceed hard I/O limits for tables.

Throttling and Hot Keys

In DynamoDB, the provisioned capacity of a table is shared evenly across all the partitions in a table with the consequence that each partition has a capacity less than the table capacity. For example, a table that has been provisioned with 400 WCU and 100 RCU and had 4 partitions, each partition would have a write capacity of 100 WCU and 25 RCU.  

So if you had a query send most read/write requests to a single partition—a hot partition— and that throughput exceeded the shared capacity of that partition, your queries would be throttled even though you had unused capacity in the table. If your application creates a hot partition, your only recourse would be either to redesign the data model or to over allocate capacity to the table.

Adaptive capacity can provide up to 5 minutes of grace time by allocating unused capacity from other partitions to the “hot” one provided unused capacity is available and hard limits are not reached. The hard limits on a partition are 3,000 RCU or 1,000 WCU.

Cross-Region Replication

If you want a DynamoDB table to span regions, you’ll need to use global tables that require careful planning and implementation. The tables in all regions must use on-demand allocation or have auto scaling enabled. The tables in each region must be the same: time to live, the number of global, and local secondary indexes, and so on.

Global tables support only eventual consistency that can lead to data inconsistency. In such a conflict, DynamoDB applies a last writer wins (lww) policy. The data with the most recent timestamp is used.

Migrating an existing local table to a global table requires additional steps. First, DynamoDB Streams is enabled on the original local table to create a changelog. Then an AWS Lambda function is configured to copy the corresponding change from the original table to the new global table.

Cost Explosion 

As highlighted in Managed Cassandra Versus DynamoDB, DynamoDB’s pricing model can easily make it expensive for a fast growing company:

  • The pricing model is based on throughput, therefore costs increase as IO increases. 
  • A hot partition may require you to overprovision a table.
  • Small reads and writes are penalized because measured throughput is rounded up to the nearest 1 KB boundary for writes and 4 KB boundary for reads.
  • Writes are four times more expensive than reads.
  • Strongly consistent reads are twice the cost of eventually consistent reads. 
  • Workloads performing scans or queries can be costly because the read capacity units are calculated on the number of bytes read rather than the amount of data returned.
  • Read heavy workloads may require DynamoDB Accelerator (DAX) which is priced per node-hour consumed. A partial node-hour consumed is billed as a full hour.
  • Cross region replication incurs additional charges for the amount of data replicated by the Global Secondary Index and the data read from DynamoDB Streams.
  • DynamoDB does not distinguish between a customer-facing, production table versus tables used for development, testing, and staging.

Summary 

DynamoDB’s advantages are an easy start; absence of the database management burden; sufficient flexibility, availability, and auto scaling; in-built metrics for monitoring; encryption of data at rest.

Cassandra’s main advantages are: fast speed of writes and reads; constant availability; a SQL-like Cassandra Query Language instead of DynamoDB’s complex API; reliable cross data center replication; linear scalability and high performance. 

However, the mere technical details of the two databases shouldn’t be the only aspect to analyze before making a choice.

Cassandra versus DynamoDB costs must be compared carefully. If you have an application that mostly writes or one that relies on strong read consistency, Cassandra may be a better choice.

You will also need to know which technologies you need to supplement the database? If you need the open source like ElasticSearch, Apache Spark, or Apache Kafka, Cassandra is your choice. If you plan to make extensive use of AWS products, then it’s DynamoDB.

If you use a cloud provider other than AWS or run your own data centers, then you would need to choose Cassandra.