I watched the classic movie “2001: a Space Odyssey” for the nth time on the weekend. My previous favourite quote from HAL (the eventually paranoid and murderous ship AI) was:
Dave: Open the pod bay doors, HAL.
HAL: I’m sorry, Dave. I’m afraid I can’t do that.
But my new favourite quote is now another “I’m sorry, Dave” line:
Dave: Do you know what happened?
HAL: I’m sorry, Dave. I don’t have enough information. (HAL was lying).
Sensors combined with the massively scalable NoSQL Cassandra Open Source Database are one way of solving the problem of lack of information. Sensors provide streams (or floods!) of real-time data to better understand, predict and control the world. Sensors can be data sources but also actuators which can control things. Sensors can be central (e.g. LHC which has several large detectors, see my last blog) or distributed (e.g. the Square Kilometre Telescope, or SKA, will be a distributed radio telescope based in South Africa and Australia), fixed or mobile, and measure data at their location or even remote from them. The Voyager spacecraft (launched to the outer planets in 1977 and still going) is a good example of a mobile sensor platform, that takes measurements and returns them back to earth from remote objects (e.g. planets) and that is also an actuator (i.e. it can be controlled remotely).
Sensor data and networks can be complex with distributed topologies and need to handle (receive, process, correlate, store, etc) large volumes of disparate data. When I worked with CSIRO I spent several years evaluating the OGC Sensor Web standards and technology by building a prototype implementation for a continental scale (Australia wide) environmental monitoring platform. Thousands of sensors were distributed all around Australia, data was collected and processed at different locations (distributed brokers), which could be subscribed to by client applications in multiple locations for processing, storing, and visualisation. It used a complex pub-sub streaming broker architecture. Brokers could subscribe to events from each other which led to the interesting problem of “infinite event loops”, as situations could arise where brokers could subscribe (indirectly via other brothers) to their own events!
As events moved from sensors through the system the data size also tended to increase substantially as events were enriched (e.g. with location information). Raw sensors typically produced 10 bytes per event (and often produced multiple different event types per second), increasing to 100 bytes at edge collectors and 1000 bytes at the brokers. Performance was an issue particularly for real-time event stream processing where historical data had to be accessed in conjunction with timed window events (caching was a key to this, and we invented a proto-kafta architecture).
Write scalability was difficult to achieve when a flood of events had to be persisted in a database while supporting low latency reads (read only replicas were tried). We also explored ways of reducing the volume of sensor data through “demand-based” subscriptions, which allowed subscribers to request only data at a certain frequency (e.g. once a minute) or when certain conditions were met (e.g. > 5% change in value). This also allowed brokers to optimise subscription and publications and even ignore events completely when no active subscriptions required them.
More recently I was working on modelling large scale system performance from monitoring data. One APM vendor we worked with collected performance data in real-time from 1000s of distributed applications & hosts and correlated it to produce structured per transaction traces. Each transaction could be 1MB+ of data and with a monitored workload throughput of 1,000TPS could easily produce 1,000,000 Bytes of data per second (1GBs, 3.6TB/hour). The APM tool could handle this, the challenge for use was getting the data out fast enough (maybe they were using Cassandra?) – we used an incremental sampling technique.
As I mentioned in my last blog, my first mission with Instaclustr is to explore a typical Sensor/Data Analytics use case of Cassandra, by using the Instametrics monitoring data that we collect from our managed Cassandra clusters and (naturally) store in a Cassandra cluster. I decided to do a dummy run this week by creating my own Cassandra trial cluster, write a test Java client to connect to it, create some sample time series data in it, get it back out and graph it. I’ll cover Creation here, and Connection and the client example in the next blog.
Cassandra Cluster Creation in under 10 minutes (1st contact with the monolith)
By far the easiest way to create a Cassandra cluster is to use the Instaclustr free trial. No credit card details are needed and you can create a small cluster (three nodes) in less than 10 minutes. Here are the sign up instructions. Once signed up you can create a trial cluster.
Step 1: Creating your first cluster
The first thing you have to enter is a cluster Name (trivial) and Network (trickier). For Network I just picked the first option at random. The main factors to consider are:
- If you are using a private network address range already then you don’t want to overlap it; and
- How many hosts are available to you for a given address range. (See documentation: Network Address Allocation)
Step 2: What service do you want?
Next select the service you want out of Cassandra, Elassandra, and the version. I picked Apache Cassandra 3.11. For development and production clusters you can also pick add-ons from: Apache Spark, Apache Zeppelin, Lucene (Cassandra Lucene Index Plugin). But not for the trial cluster (the instances are too small).
Step 3: Choose your cloud provider
Under the Data Centre section, you choose which public cloud provider will host your cluster, and options including regions, names, sizes etc. Amazon (Developer, Starter) and Google (Developer, Professional) are the two free trial options available:
- AWS: Developer, Starter, t2.small, 5GB SSD, 2GB RAM, 1 core, $20-$50/month/node (depending on region). Note: AWS throttles t2 instances to the base performance once credits have been expended and until additional credits have been earned.
- Google: Developer, Professional, n1-standard-1, 30GB SSD, 3.75GB RAM, 1 core, $81/month/node.
The Google free trial looks like great value for money, but it really depends on your use case (e.g. where are your customers, applications, data?). I picked AWS in Sydney. Post-trial options for the AWS Sydney region are varied. There is one other Developer node size, Professional, on t2.medium instances with 30GB SSD, 4GB RAM, 2 cores, for $124/month/node. There are nine production alternatives provide a wide range of options from 80-2000GB SSD, 8-61GB RAM, and 2-8 CPU cores.
For a production cluster how you choose the right starting size, and can you resize? Statically? Dynamically? This article on maximising availability covers some of the issues to consider. Permanent resizing (adding more nodes or data centres) is possible. Nodes can be resized (for CPU and memory) if your cluster uses resizeable size (read more here ondynamic resizing). Instaclustr has several articles on benchmarking Cassandra on AWS which help with selecting instance sizes.
EBS Encryption for data at rest can be selected. “No encryption key was found for this AWS region”, so I didn’t worry about this option(see Setting up a Datacentre with EBS Encryption for more information). Encryption can be applied afterwards but needs a support request.
Step 4: How many nodes do I need?
Nodes is an important magic number setting. You can try changing the number which will give helpful messages.
- “0” is impossible
- “1” is too small to meet availability requirements
- 2 is possible, but with the warning that Nodes are not evenly distributed (as there are 3 racks/AZ in the Sydney AWS data centre), which brings us back to 3 nodes, with 1 node allocated to each rack/AZ.
- Multiples of 3 are also possible (as they will result in even distribution), but for the free trial 3 is also the maximum number – So 3 it is!
Step 5: Secure your cluster
Under Cassandra Options there a few final things to for Network and Security. “Use private IP addresses for node discovery” – should this be ticked or not? At the time I couldn’t work this out but guessed that it shouldn’t be ticked given that I wanted to connect via public IP addresses from my laptop. I’ve since found these two relevant documents:
Note that this also pre-empts the question I had at this point which was what happens after the cluster is created? How and what connects to it? (Next blog)
If you want to try the cluster out using a client on your premises (I did), then under Security leave the default as “Add IP_Address to cluster firewall allowed addresses”. More can be added later.
What did this give me? A cluster with 3 x Developer Starter – 15GB total storage (number of nodes x Disk size, 3x5GB=15GB), in AWS VPC Asia Pacific (Sydney) with Apache Cassandra 3.11, at cost of $120.00/month (once the trial expires).
|Cluster||3 × Developer Starter · 15 GB total storage|
Amazon Web Services (VPC) · Asia Pacific (Sydney)
Apache Cassandra 3.11
Step 6: Spin up your cluster!
That’s it. Click on “Create Cluster” and you’re off and running. It took under 10 minutes for my trial cluster to go from nothing to running, and you get updates of node states including Provisioning, Joining, Running. Some nodes may take longer than others.
Given that I’m not a Cassandra expert (yet) and there may be some over-simplifications in these blogs (which I will aim to rectify subsequently), I leave the last word to HAL (which ironically was correct):