By Paul Brebner Thursday 17th August 2017

Paul Brebner (the Petabyte Person) joins Instaclustr (the Petabyte Company)

Technical

1. Hello, World!

Hi, I’m Paul Brebner and this is a “Hello, World!” blog to introduce myself and test everything out. I’m very excited to have started at Instaclustr last week as a Technology Evangelist.   

One of the cool things to happen in my first week was that Instaclustr celebrated a significant milestone when it exceeded 1 Petabyte (1PB) of data under management.  Is this a lot of data? Well, yes! A Petabyte is 10^15 bytes, 1 Quadrillion bytes (which sounds more impressive!), or 1,000,000,000,000,000 bytes.

Also, I’m a Petabyte Person. Apart from my initials being PB, the human brain was recently estimated to have 1PB of storage capacity, not bad for a 1.4kg meat computer! So in this post, I’m going to introduce myself and some of what I’ve learned about Instaclustr by exploring what that Petabyte means.

Folklore has it that the 1st “Hello, World!” program was written in BCPL (the programming language that led to C), which I used to program a 6809 microprocessor I built in the early 1980’s. In those days the only persistent storage I had were 8-inch floppy disks. Each disk could hold 1MB of data, so you’d need 10^9 (1,000,000,000) 8-inch floppy disks to hold 1PB of data.  That’s a LOT of disks, and would weigh a whopping 31,000 tonnes, as much as the German WWII Pocket Battleship the Admiral Graf Spee.

the German WWII Pocket Battleship the Admiral Graf Spee

Source: Wikipedia

1PB of data is also a lot in terms of my experience.  I worked in R&D for CSIRO and UCL in the areas of middleware, distributed systems, grid computing, and sensor networks etc between 1996-2007. During 2003 I was the software architect for a CSIRO Grid cluster computing project with Big Astronomy Data – all 2TB of it. 1PB is 500,000 times more data than that.

What do Instaclustr and the Large Hadron Collider have in common?

Large Hadron Collider Instaclustr

Source: cms.cern

Apart from the similarity of the Instaclustr Logo and the aerial shot of the LHC? (The circular pattern)

LHC from above

Instaclustr logo & the Large Hadron Collider

Source: Sequim Science

The LHC “Data Centre processes about one petabyte of data every day“.  More astonishing is that the LHC has a huge number of detectors which spit out data at the astonishing rate of 1PB/s – most of which they don’t (and can’t) store.

2. Not just Data Size

But there’s more to data than just the dimension of size. There are large amounts of “dead” data lying around on archival storage media which is more or less useless. But live dynamic data is extremely useful and is of enormous business value. So apart from being able to store 1PB of data (and more), you need to be able to: write data fast enough, keep track of what data you have, find the data and use it, have 24×7 availability, and have sufficient processing and storage capacity on-demand to do useful things as fast as possible (scalability and elasticity).  How is this possible? Instaclustr!

The Instaclustr value proposition is: “Reliability at scale.”

Instaclustr solves the problem of “reliability at scale” via a combination of Scalable Open Source Big Data technologies, Public Cloud hosting, and Instaclustr’s own Managed Services technologies (including cloud provisioning, cluster monitoring and management, automated repairs and backups, performance optimization, scaling and elasticity).

How does this enable reliability at Petabyte data scale? Here’s just one example. If you have 1PB+ of data on multiple SSDs in your own data centre an obvious problem would be data durability.  1PB is a lot of data, and SSDs are NOT 100% error free (the main problem is “quantum tunneling” which “drills holes” in SSDs and imposes a maximum number of writes).  These people did the experiment to see how many writes it took before different SSDs died, and the answer is in some cases it’s less than 1PB. So, SSDs will fail and you will lose data – unless you have taken adequate precautions and continue to do so. The whole system must be designed and operated assuming failure is normal.

The Cassandra NoSQL database itself is designed for high durability by replicating data on as many different nodes as you specify (it’s basically a shared-nothing P2P distributed system).  Instaclustr enhances this durability by providing automatic incremental backups (to S3 on AWS), multiple region deployments for AWS, automated repairs, and 24×7 monitoring and management.

3. The Data Network Effect

1PB of data under management by Instaclustr is a noteworthy milestone, but that’s not the end of the story. The amount of data will GROW! The Network effect is what happens when the value of a product or service increases substantially with the number of other people using it. The classic example is the telephone system.  A single telephone is useless, but 100 telephones connected together (even manually by an operator!) is really useful, and so the network grows rapidly.  Data Network effects are related to the amount and interconnectedness of data in a system.  

For Instaclustr customers the Data Network effect is likely to be felt in two significant ways.  Firstly, the amount of data customers have in their Cassandra cluster will increase – as it increases over time the value will increase significantly; also velocity: the growth of data, services, sensors and applications in the related ecosystems – e.g in the AWS ecosystem – will accelerate.  Secondly, the continued growth in clusters and data under management provides considerable more experience, learned knowledge and capabilities for managing your data and clusters.

Here’s a prediction. I’ve only got 3 data points to go on. When Instaclustr started (in 2014) it was managing 0PB of data. At the end of 2016 this had increased to 0.5PB, and in July 2017 we reached 1PB. Assuming the amount of data continues to double at the same rate in the future, then the total data under management by Instaclustr may increase massively over the next 3 years

Graph of months vs PB Data Instaclustr

(Graph of Months vs PB of data).

4. What next?

I started as Technology Evangelist at Instaclustr (last week). My background for the last 10 years has been a senior research scientist in NICTA (National ICT Australia, now merged with CSIRO as Data61), and for the last few years at CTO/consultant with a NICTA start-up specializing in performance engineering (via Performance Data Analytics and Modelling). I’ve invented some performance analytics techniques, commercialized them in a tool, and applied them to client problems, often in the context of Big Data Analytics problems. The tool we developed actually used Cassandra (but in a rather odd way, we needed to store very large volumes of performance data for later analysis from a simulation engine, hours of data could be simulated and needed to be persisted in a few seconds and Cassandra was ideal for that use case). I’m also an AWS Associate Solution Architect. Before all this, I was a Computer Scientist, Machine Learning researcher, Software Engineer, Software Architect, UNIX systems programmer, etc.  Some of this may come in useful for getting up to speed with Cassandra etc.

Over the next few months, the plan is for me to understand the Instaclustr technologies from “end user” and developer perspectives, develop and try out some sample applications, and blog about them. I’m going to start with the internal Instaclustr use case for Cassandra, which is the heart of the performance data monitoring tool (Instametrics).  Currently there’s lots of performance data stored in Cassandra, but only a limited amount of analysis and prediction being done (just aggregation). It should be feasible and an interesting learning exercise to do some performance analytics on the Cassandra data performance data to predict interesting events in advance and take remedial action.

I’m planning to start out with some basic use cases and programming examples for Cassandra (e.g. how do you connect to Cassandra, how do you find out what’s in it, how do get data out of it, how do you do simple queries such as finding “outliers”, how do you more complex Data Analytics (e.g. regression analysis, ML, etc), etc. I’ll probably start out very simple. e.g. Cassandra & Java on my laptop, having a look at some sample Cassandra data, try some simple Data analytics, then move to more complex examples as the need arises. E.g. Cassandra on Instaclustr, Java on AWS, Spark + MLLib, etc. I expect to make some mistakes and revise things as I go. Most of all I hope to make it interesting.

Watch this space 🙂

 

Site by Swell Design Group