Apache Spark™ Streaming and Cassandra

Apache Spark
Apache Cassandra
Technical

November 17, 2015
By Ben Slater

We have just posted a step by step tutorial on our support portal on using our Managed Apache Spark and Cassandra offering to set up Spark Streaming to receive and summarise data from Apache Kafka and then save it to Cassandra.

Developing this example got me thinking – what are the solution patterns where it makes sense to use Spark Streaming alongside Cassandra? This post sets out some of the patterns where I think this makes sense. I’m sure there are some other examples as well, but even this helps to illustrate why we think Spark Streaming and Cassandra are such a great fit.

Summarising data before saving.	This is the scenario we illustrate in our tutorial. Where you have a large stream of data but don’t need all the detail to be persisted you can use Spark Streaming to performance the summarisation on the live stream before saving to Cassandra.
Maintenance of summary tables.	A common solution pattern (the lambda architecture) is to maintain summary tables as well storing the raw data. For example, in a system receiving monitoring data you might store the raw data with a TTL of a day or two, 5 minute averages with a TTL of a week, etc. Using Spark Streaming to calculate the 5 minute averages will often be more efficient that running batch processes every 5 minutes to calculate the roll-up.
Enriching data before saving.	Often, the raw data you receive from a stream is not well suited for the question you are most likely to ask of the data. For example, you might receive total operation time and number of operations from a sensor when what you mostly care about is the rate per second and average operation time over the period. Spark Streaming could be used to add these values to the stream before saving.
Using Cassandra as a source of reference data.	Quite often when processing a data stream, you will need to look up a reference value to make a decision. For example, you might be processing a stream of IOT data and have an alerting threshold per sensor (for example, the 95% value over the last 48hrs). Storing the threshold in Cassandra for lookup as need as the stream is processed would be a great solution. Combine this with the patterns above to use Spark Streaming to maintain the threshold as well.

IoT Overdrive Part 2: Where Can We Improve?

Instaclustr for Apache ZooKeeper ® 3.7.2 and 3.8.4 are Generally Available

Running the Apache Camel™ HTTP Kafka Source Connector on Instaclustr Managed Apache Kafka®