By Ben Bromhead Tuesday 31st January 2017

Apache Cassandra: The Big Data Foundation

Technical

Introduction

To state the obvious, we here at Instaclustr are massive fans of Apache Cassandra. We have built our company and our managed service around this database technology and the awesomeness of its capability.

There are a lot of well documented use cases and amazing companies providing examples of how they are using open source Apache Cassandra, but as a managed service we get to see first hand the power of this amazing technology and what it can do for our diverse customer base.

Suffice to say that our experience over the last two years has us even more convinced that Apache Cassandra is the foundation technology for the next wave of global-scale applications and solutions.

Apache Cassandra use cases

It is fair to say that we have probably seen it all with the diverse range of deployments – the good the bad and sometimes the ugly. Here is our take on the most common deployments:

  • Security.  The fraud and threat detection use case is very active in our environment.  We see the application in most cases is related to identifying anomalies through data mining and deep analytics to identify security-related events of interest.
  • Messaging.  Several of our customers have social media and data sharing applications that are being used with messaging services at it’s core.
  • IoT.  This is probably the most common of our customers use cases.  We have many customers representing a wide range of industries, using Cassandra as an IoT solution.  We also work with a number of customers who are providing IoT platforms to their own customer base.
  • Recommendation & Personalisation.  Many of our customers are using the power of personalisation. This is a very common within the AdTech industry, but also some of our customers are building unique learning platforms that are personalized to an individual student.
  • Catalogues & Playlists. This particular use case we haven’t seen as much of the others but the data models and usage patterns typical seen within catalogues and playlists are usually a small part of a much larger application.

The most popular industries? We have a large customer base within the AdTech space, where the key metrics of performance and scalability are important.  We also have core customers in the FinTech industry where personalization, high availability and security are all important.   We also have several customers in the EdTech space developing specialized and personalized learning platforms.

Another interesting insight is that we have an amazingly diverse client base that ranges from personal projects to early stage start-ups all the way through to 140-year-old, billion dollar companies looking to transform and enhance their business.  We can see first hand that you don’t have to be a large company to be working with large datasets.

With several of our original customers we have been with them on the journey from an initial 3 node cluster, through to large production clusters with separate staging and testing environments.

Diverse use cases help us improve everyday

The beauty of having grown our customer base so rapidly and widely is that we have benefited from gaining insight and understanding into the wide application of this technology and the details of specific use cases.  This provides us with a unique perspective of the deployment of Apache Cassandra. We see its adaptability, but we also see its complexity and its temperament when it is not handled well.

We see the specific nuances associated with operating an efficient production grade environment and cluster for all of the different use cases. Having such a wide range of different deployments under our care is giving us an ever increasing richness in our own data that we are now analysing through our Instametrics monitoring environment.  This is helping us to continually improve our capability and to continue to automate and refine our service offering.

We have also been in the unique position of growing with our customers and helping them scale, in some cases rapidly.  This also provides us with insight into how to build out a cluster or environment efficiently when an application goes viral, or the application has to ingest vast amounts of data.

With great power comes great responsibility

There is no doubt that Apache Cassandra provides great power, but the trade-off is that this also comes with a certain level of complexity.   You can’t expect that a database technology like Apache Cassandra can simply scale rapidly, provide high throughput performance and be continuously on without there being some work to do.

Continual monitoring, maintenance and performance tuning are important activities that must go with any database and associated technology environment to keep it operating efficiently and effectively.  But probably just as important is good design and planning up front.

When the data layer is an afterthought

In many cases we see that the data layer follows the lead from the application. That is, the time effort and focus at first for many start-ups is the application and what the customer is building on the front-end. This is often necessary to demonstrate a concept to an investor or to simply get things up and running quickly while finding market fit.  This approach means that often the data is an afterthought.

When the data is an afterthought we often see that the application and database will work okay at the beginning, but it is when they try to scale that things get ugly with Cassandra. If you don’t treat the data layer with a certain amount of mechanical sympathy, and you don’t plan effectively from the start, then there can be consequences down the track.

Plan to be big from the start

When Apache Cassandra has been selected as the database by an engineer at the beginning of a project, we know that they mean business and that they are planning on their application or solution being global.   What we often find is that while they are thinking big from the start, that sometimes the design and planning isn’t equally as big.

We see that effective planning and design of the data architecture and infrastructure from the start means that our customers tend to prosper and scaling and performance are rarely an issue. When the team is only thinking big, and not designing and planning big, then problems can arise down the track.

And yes we can speak from experience.  Bringing your infrastructure and database back from the brink can be a difficult and painful experience.

You are much better off doing the work up front.  Even if your environment works well initially, it is when you get to the point of having to scale is when you will start to see issues.

Conclusion

If you are thinking big, then plan to be big. If you design the architecture and infrastructure yourself, get an independent expert with some experience to validate your work.  Check and check again.

Doing it right the first time will set you up for efficient scaling, high performance and a continuously on environment and save you weeks of pain when you have terabytes of data structured the wrong way.  Your application might work okay at the start, but it is when it comes to the point of scaling that we see most of the issues arise for our customers.

Site by Swell Design Group