By Ben Slater Friday 27th May 2016

Managed Cassandra versus DynamoDB

Popular Technical — Cassandra

How is Cassandra compared to DynamoDB?

A common question we hear from potential customers is “how is Cassandra compared to DynamoDB”? We’re big fans of AWS, in fact, most of our business runs on EC2, and certainly believe DynamoDB is a great solution for some use cases. Of course, we also believe that Cassandra is an outstanding solution in a lot of cases. 

Differences between Cassandra and DynamoDB

Generally, the datastore you choose is based on the problem you are trying to solve. Both Cassandra and DynamoDB offers incredible scale and availability. Both can serve 10’s of millions of reads and writes, both can offer a level of resilience in the face of failure and both share a similar underlying architecture (dynamo) but that is where the similarities end.

There are a number of differences in terms of :

  • Data Structure
  • Resource Allocation
  • Partitioning Model
  • Cost benefits
  • Other Differentiating Factors

Let us understand these in details:

Data Structure of Cassandra vs DynamoDB

Both Cassandra and DynamoDB has variety of tangible differences when it comes to Data structure

  • Cassandra is implemented as a wide column store (you can loosely think of it as a key -> key -> value store) and DynamoDB is a pure key value store. 
  • Cassandra’s data model makes it simple to cluster related values within a partition key, making time series based data models a cinch. DynamoDB makes time series datasets slightly more challenging. A global secondary index together with down sampling timestamps can be a possible solution with DynamoDB.
  • DynamoDB is generally simpler from an operations perspective, though with managed service providers like Instaclustr, Cassandra can be just as simple to manage.

Resource Allocation and Partitioning Model

Cassandra runs on instances/machines and can be configured to take advantage of native and OS level page caching, meaning hot partitions will generally be served from memory and if needed the entire resources of a single machine (and replicas) can be dedicated to serving a single partition if needed (we would argue you would likely want to revisit your data model first, but it can be done)

DynamoDB on the other hand is limited to roughly {#of Partitions = Max ( #of Partitions (for size) | #of Partitions (for throughput)} 5000 writes/reads per second for a given 10GB shard. This means that hot partitions are limited in both size and in throughput, whereas with Cassandra they are generally limited purely on a size basis. Read this blogpost by segment.io who spent a long time troubleshooting a hot partition issue .

This can create all sorts of headaches with your most frequently used keys, and hot partitions/keys generally correlate with your largest, heaviest and most important end customers. Not a great situation to get into.

Understanding Cost Benefits of Cassandra and DynamoDB

Here is a specific Use Case for Cassandra and DynamoDB to understand Cost Benefits

A good way to contrast the strengths and weaknesses of both solutions is to look at a specific use case such as our own Instametrics capability which we discussed in detail on our Instaclustr blog. To summarize, Instametrics allows us to store and analyze monitoring data from the  1500+ Apache Cassandra nodes that we manage. 

The key stats are:

  • 12 x m4.xl-balanced (800GB) nodes
  •  Replication Factor 3
  • > 40,000 writes/sec (24 x7 consistent load)
  •  ~ 500 reads/sec consistent load (peaks at 1-2k reads/sec) at consistency level 1
  • Small data per read/write

This is our current running load – the cluster is pretty well utilised but we believe we can still push it a bit harder and in particular we plan to increase the levels of read operations. 

In this use case, we have shared a comparison around cost structure that ultimately affects business bottomline. 

The Instaclustr price is very straightforward to calculate 12 m3.xl-balanced nodes at $727 each per month on-demand is a total, all-inclusive cost of $8,724 per month. As you scale out, particularly if running in your own account, you can expect to see the average cost per node drop.

 

DynamoDB costs would consist of a number of components:

  • Network at $0.09 / GB network out (free in): 500 small reads/sec adds up to something like 12GB per month and you get one GB free. So, for our use case this comes to a whopping $0.99 / month. However, our use case current has an extremely low level of reads and a higher level of reads could quickly make this a significant cost.
  • Read Throughput at $0.00013 per hour for every unit of Read Capacity. According to AWS, this corresponds to 7,200 reads per hour (eventual consistency) so that’s 20 reads/sec. Monthly read capacity cost (ignoring bursts) would therefore be 250 x 0.00013 x 720 (hours in the month) = $23.40.
  • Write Throughput at $0.00065 per hour for every unit of Write Capacity. According to AWS, this corresponds to 3,6000 write per hour (eventual consistency) so that’s 1 write/sec. Monthly write capacity cost (ignoring bursts) would therefore be 40000 x 0.0065 x 720 = $18,720.
  • Storage at $0.25 / GB – this includes replication across 3 availability zones (irrespective of whether you can actually get 3 AZs in a region on EC2). To run Cassandra, you either need to leverage EBS on EC2 or use ephemeral storage: A Cassandra node can generally be run to about 70% full so 70% x (12 nodes/replication factor 3) * 800GB = 2,240 GB. Subtract 25 Free GB and multiple by $0.25/ GB gives you $553/month storage cost.

The total on-demand DynamoDB cost for our use case would, therefore, be $19,298 – quite a difference! 

Of course, this reflects the specific use case we have with Instametrics – different balances of read, writes and storage space can have quite different results. However, for most examples, we find that Managed Cassandra will be cheaper than DynamoDB for any significant, consistent workload level. 

So, we’ve shown that for this use case, there are significant savings to be had running Cassandra Vs DynamoDB.

The other cost to consider with DynamoDB is support. The support included in the Instaclustr price for a cluster like Instametrics is fairly equivalent to AWS Enterprise level support. The minimum charge for this level of support from AWS is $15,000 for less than 150k total monthly usage, 7% of monthly usage costs due from 150k-500k and going down from there. Let’s assume you’re a big, but not a massive customer and call this 7%, taking our total AWS bill up to $20,649.

Additionally, DynamoDB and Cassandra share different cost profiles, especially when you limit Cassandra to just running on AWS. The use case above shows the cost difference you would enjoy with Cassandra. As already mentioned, to run Cassandra you either need to leverage EBS on EC2 or use ephemeral storage. For EBS you need 3 instances in different AZs resulting in a replicated storage cost of 30 cents (3 * 10c) for gp2 based storage, for ephemeral storage such as i3s you end up with a replicated storage cost of 71c per GB per month! Mind you EBS comes with a “free” 3 IOPs not including compute/memory and ephemeral storage comes with a “free” 6 reads per second per GB of storage including compute/memory.

Given the baseline requirement within Cassandra to have a certain amount of compute/memory attached to storage it sets a floor in which it is cost-effective, anything below that you are paying for unused compute/memory. With DynamoDB you could have 500TB with only 10 provisioned WCU (write capacity unit) and RCU (read capacity unit) and serve traffic across the entire dataset. This makes Cassandra more cost-effective than DynamoDB only when you have a read/write workload per gigabyte above a certain threshold.

Other Differentiating Factors between Cassandra and DynamoDB 

There are a range of functional differences between the two technologies apart from the ones we have already discussed above. Some of the main differentiating highlights are:

  • Open Source vs Proprietary  

Apache Cassandra is fully open source so you know you can always bring management in house or move to a different managed service provider, but with DynamoDB, you’re 100% locked into AWS.

  • Latency

Cassandra typically provides significantly lower latency than DynamoDB.

  • Data Querying

With Cassandra you can query your data with a SQL-like language rather than a proprietary API – this means lower learning curve and more readable code in many circumstances.. 

  • Data Modelling

Cassandra provides more sophisticated data modelling options such as user-defined types, JSON support and (in the latest versions) materialized views.

  • Analytic Workloads

Cassandra provides native ability to isolate analytic workloads (e.g. Spark) from OLTP while transparently maintaining data replication.

  • Serialization Process

Cassandra relies on binary format to communicate with the database making the serialisation process more efficient compared to DynamoDB JSON/HTTP interface. Using DynamoDB, the user will have to marshal to/from JSON which impacts performance. 

  • Time-To-Live

Cassandra offers finer control as it supports Time-to-live (TTL) on columns which makes it possible to expire only certain fields of an item. Whereas, DynamoDB supports TTL at the item level, which means that when the TTL expires the whole item is deleted.

  • Data Partitioning 

DynamoDB doesn’t provide an easy way to bulk-load data (it is possible through Data Pipeline) and this has some unfortunate consequences. Since you need to use the regular service APIs to update existing or create new rows, it is common to temporarily turn up a destination table’s write throughput to speed import. But when the table’s write capacity is increased, DynamoDB may do an irreversible split of the partitions underlying the table, spreading the total table capacity evenly across the new generation of tables. Later, if the capacity is reduced, the capacity for each partition is also reduced but the total number of partitions is not, leaving less capacity for each partition. This leaves the table in a state where it much easier for hotspots to overwhelm individual partitions.

  • Data Replication

Data Replication is one of the key differentiating factors between the two technologies

While Cassandra is fully tuneable and lets the user configure every aspect of the data replication, DynamoDB gives no control over the number of replicas involved as everything is performed automatically by AWS.  

Though, DynamoDB recently introduced an offering called Global Tables. Global tables allow DynamoDB to replicate data across multiple regions, under the hood this is simply automating the work customers had to previously do with DynamoDB streams. This makes DynamoDB only global in the sense you have local tables with a queue of updates being streamed around the world, with no mechanisms for resolving inconsistencies between regions, no guarantees for replication latency and no guarantees for global consistency at operation time. Cassandra, on the other hand, provides a set of consistency resolution mechanisms, tunable consistency at query time as well as capabilities like LWT and batch operations to manage consistency across tables.

  • DynamoDB does not allow an empty string as a valid attribute value. 

The most common workaround is to use a substitute value instead of leaving the field empty.

  • Backup

DynamoDB currently only supports snapshot style backups. Cassandra support full commitlog backups, a feature exposed through Instaclustr’s Continuous Backups capability

  • Use cases

Cassandra’s read and write operation coupled with low latency and linear scalability makes it is a good for use cases for  to IoT, fraud detection, messaging system, etc. and (as per AWS) DynamoDB works well for gaming, bidding platform, IoT etc. due to its features of high availability and low latency.        

Also, it is important to make sure that DynamoDB resource limits are compatible with your dataset and workload. For example, the maximum size value that can be added to a DynamoDB table is 400 KB (larger items can be stored in S3 and a URL stored in DynamoDB).

A Quick Glance at the High Level Differences between Managed Cassandra and DynamoDB

DescriptionManaged CassandraAmazon DynamoDB
LicenceApache Open SourceAWS Proprietary
Deployable CloudAny CloudOnly on AWS
Developer’s learning curveLower. Querying data with SQL like languageQuerying data with a proprietary API
Data ModellingJSON SupportMaterialized view
TTL (Time-to-live)Support on columnsSupports at the item level
Multi-Region, Active-Active Data ReplicationNative, tunable with full reconciliation and repair capabilitySupported but no native reconciliation and repair
Capacity Sharding UnitInstance level (many partitions, up to 2TB)Partition level (up to 10GB)
BackupsContinuous (zero data loss), daily or on-demand back upsOn-Demand Back up

When would you use DynamoDB instead of Cassandra? 

Of course, we’re a bit biased toward Apache Cassandra so what are some of the reasons you’d want to use DynamoDB? 

For us, the clearest use-case is where you need to rapidly scale capacity up and down. DynamoDB clearly has some sophisticated magic behind the scenes to allow this, and it can change the economics dramatically in DynamoDB’s favour if you have a very variable capacity requirement that you can predict in advance (for example running large analytics batch jobs). However, Instaclustr’s Dynamic Scaling capability closes the gap here substantially for many uses cases. 

If you are interested in exploring your use-case in more detail then please get in touch via info@instaclustr.com .

FREE TRIAL

Spin up a cluster in less
than 5 minutes.
(No credit card required)

Sign Up Now
Close

Site by Swell Design Group