The R4 is AWS newest instance type. It is in the “Memory Optimized” family of AWS offering meaning it offers a relatively high ratio of memory to CPU (for example, 30.5GB of RAM for the 4 core r4.xl versus 16GB of RAM for an m4.xl). This additional RAM can be very useful when running Cassandra and Spark together and can provide a significant performance increase when running straight Cassandra due to increased IO caching. We had previously benchmarked the R3 instance type but did not introduce these as the relatively low EBS bandwidth made them unsuitable for Cassandra.
We conducted Cassandra benchmarking of the R4 type against our existing M4 offerings and found significant performance improvements running a fairly IO intensive mixed workloads.
Our testing procedure is:
- Insert data to fill disks to ~30% full.
- Wait for compactions to complete and EBS burst credits to regenerate.
- <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-wp-preserve="%3Cscript%20src%3D%22https%3A%2F%2Fgist.github.com%2Fslater-ben%2F4e8aa0668aa70f502b2848bc8ebce8dc.js%22%3E%3C%2Fscript%3E" data-mce-resize="false" data-mce-placeholder="1" class="mce-object" width="20" height="20" alt="<script>" title="<script>" />
As with any generic benchmarking results for different data models or application may vary significantly from the benchmark. However, we have found it to be a good test for comparison of relative performance that reflects will in many use cases.
Our most recent (Cassandra 3.7) benchmarking with a three node m4.xl instances and 800GB of EBS produced results of 930 op/s second. The average results using r4.xl instances with 1600GB disk produced results of 1,838 ops/second – a 100% performance improvement. The cost increase of an r4.xl based node over an m4.xl node is only 30% so this is a significant increase in price/performance.
Once again, I’d stress that results may differ significantly for any specific use case. In particular, we ran some standard Spark and Cassandra benchmarking expecting to see a significantly performance improvement but did not actually see any significant change. We need to do some further investigation here and still think R4s will be very useful to Spark uses but this illustrates the need to understand performance for your use case.
Instaclustr’s complete family of AWS offerings is summarised in the table below:
|m4.l – tiny
|Smallest available production node. Use this when getting started. We recommend scaling up to m4.xl rather than scaling out with more m4.large instances.
|m4.xl – small
|Step up from m3.xlarge when more disk required. Starting point for smaller users not ready for m4 balanced offering (lower performance as smaller disk provide less IOPS). Upgrade as you grow.
|m4.xl – balanced
|Best balance of space and performance. Suggested standard building block for most clusters.
|m4.xl – bulk
|Lowest cost bulk storage for low read ops uses cases.
|r4.xl – himem – balanced
|Newest generation, improved price performance over m4.xl for many use cases. Best balance of space and performance for many use cases.
|r4.xl – himem – bulk
|A disk space step-up from himem – balanced for those who need storage but spare processing capacity.
|Proven performer – good balance of space and performance. Basis of most of our largest production clusters. Will provide better performance than m4 based nodes for very read-heavy use cases.
|Lowest cost read performance with relatively small data volumes. Build a cluster with these for extremely high performance to data ratios.
|May provide an low cost entry point for some uses cases. Has higher throughput than an m4.l-tiny, lower cost (but much smaller disk) than an m4.xl-small.
If you would like more information about running Cassandra with EBS, check out this earlier blog post: Cassandra on AWS EBS Infrastructure/