SLAs - Watch Out for the Shades of Gray

The Service Level Agreement (SLA) is an integral part of an MSP’s (Managed Service Provider) business. Its purpose is to define the scope of services that the MSP offers, including:

guarantees on metrics relevant to their business and technology
customer responsibilities
issue management
compensation commitment when MSPs fail to deliver on the SLA.

It should set the customer’s expectations, be realistic and be crystal clear, with no scope for misinterpretation. Well, that’s how they are meant be. But unfortunately, in the quest for more sales, some MSPs tend to commit themselves to unrealistic SLAs. It’s tempting to buy into a service when an MSP offers you 100% availability. It is even more tempting when you see a compensation clause that gives you confidence in going ahead with that MSP. But hold on! Have you checked out the exclusion clauses? Have you checked out your responsibilities in order to get what you are entitled to in the SLA? Just as it is the MSP’s responsibility to define a crystal-clear SLA, it is the customer’s responsibility to thoroughly understand the SLA and be aware of every clause in it. That is how you will notice the shades of gray!

We have put together a list of things to look for in an SLA so that customers are aware of the nuances involved and avoid unpleasant surprises after signing on.

Priced SLAs

Some MSPs provide a baseline SLA for their service, and customers wishing to receive higher levels of commitment may need to fork out extra money.

At Instaclustr, we have a different take on this. We have four tiers of SLA — not to have customers pay more but because our SLA approach is based on the capability and limitations of the underlying technology we are servicing. The SLA tiers are based on the number of nodes in a cluster and a set of responsibilities that customers are prepared to commit to. Our customers do not pay extra for a higher level of SLA. They pay for the number of nodes in the cluster. With more nodes come the higher levels of SLA.

Compensation

While MSPs do their best in delivering on the SLA commitment, sometimes things go south. In scenarios where an SLA metric is not delivered, MSPs provide a compensation. Typically, it is paid in credit or a cut in the monthly bill. Customers should look into how the compensation is calculated. Some compensate based on the length of downtime while others may compensate with a flat reduction in the monthly bill irrespective of the length of downtime.

Instaclustr compensates its customers fairly, with a flat reduction in their monthly bill no matter how small the downtime is. The exact rate varies based on the SLA tier the customer’s cluster falls in. For example, a customer with a cluster of 12 production nodes (Critical tier) is entitled to up to 100% of their monthly bill with each violation of availability SLA, capped at 30%, and each violation of latency SLA, capped at 10%. Every time we fail to deliver on an SLA metric, there is a big impact on Instaclustr’s business — but we like the challenge. Continuously improving uptime and performance is at the core of our business.

Issue Management

Another important factor customers have to look at in an SLA is how the MSP handles issues. Issue management commonly includes communication touchpoints, named contacts, escalation procedure, issue impact and severity levels, first-response time and, in some cases, resolution time. Customers should familiarize themselves with each of these aspects and make their internal teams aware of them.

Instaclustr customers can get this information in our Support Policy document, which gives clear details on issue management and sets the right expectations.

Customer Responsibilities

Although the primary purpose of an SLA is for MSPs to provide guarantees on key metrics such as availability and latency, MSPs usually add a “help us help you” clause. It basically means: in order for MSPs to uphold those SLA guarantees, customers have certain responsibilities that they have to commit to. If your SLA doesn’t have customer responsibilities recorded, talk to your MSP and get it clarified first thing.

Instaclustr SLA has a list of customer responsibilities that must be met in order for us to deliver on those SLA guarantees. Each SLA tier, which is based on the number of nodes in a cluster, has different levels of responsibilities. Basically, requiring customers to take on more responsibilities to receive higher level of SLA guarantees. For example, the “Small” tier for Kafka requires customers to maintain a minimum of 3 replicas per Kafka topic to get the SLA guarantees that the tier promises. While the “Critical” tier for Kafka cluster requires customers to not only maintain 3 replicas per topic but also maintain separate testing and production clusters. This transparency upfront avoids all the uncertainties and unpleasant surprises if an issue arises.

Service/Technology Dependent SLA

Instaclustr’s SLA guarantees are defined on the basis of the technology under management and its cluster configuration. For instance, a customer with a Kafka cluster comprising 5 production nodes will be guaranteed 99.95% availability for writes but is given no guarantee for latency. However, if the same customer has another Kafka cluster with 12 production nodes (and meets other documented conditions), they will be guaranteed 99.999% availability and 99th percentile latency. Similarly, Cassandra clusters of different sizes come with different tiers of SLA—basically, the larger the cluster (more nodes), the higher the availability of data. This guarantee is backed by our experience in providing massive-scale technology as a fully managed service. Instaclustr simply offers the best SLA realistically possible for the technology and the cluster configuration.

If your MSP is promising a 100% availability SLA irrespective of the technology under management, its size and its configuration, that is simply not realistic. Be sure to check for exclusions and also review the compensation clause to make sure it is substantial.

Hold Harmless Clauses

This is probably the most important section to watch out for in an SLA. MSPs need to protect themselves from situations where an SLA guarantee isn’t met because of conditions outside their control. MSPs operate in several autonomous environments with several technologies and integrations. Even with best practices in place, sometimes something will go wrong due to no fault on the part of an MSP.

For example, Instaclustr has this clause: “All service levels exclude outages caused by non-availability of service at the underlying cloud provider region level or availability zone level in regions which only support two availability zones”. Clauses like this are critical from an MSP’s perspective as they ensure they aren’t vulnerable to conditions outside their control—simply because we can’t do anything about such large-scale disruptions in the underlying cloud infrastructure. However, an unreasonable exclusion would be to exclude failures of virtual machines (VM) in the cloud. With hundreds or thousands of VMs running on a cloud, it is normal to expect VM failures. The large-scale technologies we operate can easily be designed to handle VM (node) failures without hurting availability. If your MSP has a VM failure exclusion clause, it is time to have a conversation.

Another necessary exclusion is significant, unexpected changes in application behavior. An MSP has limited visibility into changes in your application environment (either business-level or technical) that may impact the delivered services. However, we do have the shared objective with our customers of maximising their cluster’s performance and availability. Communicating, and in some cases testing, significant changes before deploying on production is necessary to ensure the service can cope with the change. Instaclustr SLAs for latency exclude “unusual/unplanned workload on cluster caused by customer’s application”, and generally exclude “issues caused by customer actions including but not limited to attempting to operate a cluster beyond available processing or storage capacity”. Customers can avoid being impacted by these clauses by contacting our technical operations team ahead of significant changes to manage the risks.

It is common for MSPs to add a third-party exclusion clause. This means: if an issue is found to be in a third-party application they integrate with, and where they have no control, they are safeguarded. But Instaclustr does not have any third-party application related exclusion. Given that we operate in the open source technology space, adding a clause like this would be protecting ourselves in relation to any issue in Kafka or Cassandra—which is what our customers are paying us to manage. If your MSP has third-party application related exclusions, it is time to closely review them if you haven’t done so already.

Customers should understand these exclusions in depth and make sure they are not unreasonable. This will need thorough review from technical and legal teams.

It’s Not Just About Uptime

Over the years, uptime or availability has become the default term when discussing an SLA with MSPs. It is definitely the most important metric you would want covered. However, there are many other metrics that could be very valuable to a customer’s business. It is important to make sure that the SLA covers the right metrics—the ones that matter to the customer based on the application and the technology under management.

Instaclustr has always included availability and latency in SLAs, as they are two key metrics for the technologies we manage. We are glad to announce the recent addition of the Recovery Point Objective (RPO) metric to our SLA. We take daily backups, with an option of continuous backup (5-minute intervals), and our technical operations team has always treated data recovery as a top priority incident as it impacts customer’s business. So it just made sense to add an RPO guarantee to our SLA.

SLAs Are Negotiable

SLAs are a mutual contract—in fact, a legal agreement between two parties. Although it is primarily the MSP’s responsibility to draft the initial SLA, customers have the right to negotiate. Customers should look into it thoroughly and negotiate if something important to their business is missing before signing up for it. Instaclustr Sales and Customer success teams handle most of these discussions and we welcome any SLA requirements from customers and are open to negotiation.

SLAs Aren’t Everything

SLAs are a quantitative, contractual promise with financial penalties for failures. They are important for covering the most important aspects of service delivery. However, SLAs can never capture every aspect of delivery of a high-quality service that meets the customer’s needs. At Instaclustr, meeting our SLAs is a core focus but just the first step in providing a service that meets and exceeds our customers’ expectations. For example, ticket response time guarantees tell you how quickly you will get a first response to a ticket but not the quality of response. At Instaclustr we have expert engineers as the first responders to all tickets to provide the best quality response possible.

Instaclustr SLA Update

As part of our commitment to keep our SLAs relevant and realistic, we regularly review them on the basis of quantitative evidence. We have recently updated our SLAs to strengthen our commitments. Here is a summary of the changes.

Cassandra
- Recovery Point Objective (RPO) is added as a new metric, stating: “We will maintain backups to allow a restoration of data with less than 24 hours data loss for standard backups and less than 5 minutes data loss for our Continuous Back-ups option. Should we fail to meet this Recovery Point Objective, you will be eligible for SLA credits of 100% of monthly fees for the relevant cluster. If you have undertaken restore testing of your cluster in the last 6 months (using our automated restore functionality) and can demonstrate that data loss during an emergency restore is outside target RPO and your verification testing, then you will be eligible for SLA credits of 500% of monthly fees”.
- Starter plan: increase from 99.9% to 99.95% for availability target with best effort.
- Small plan: 99.9% availability for consistency ONE has been increased to 99.95% availability for consistency LOCAL_QUORUM.
- Enterprise plan: 99.95% availability has been increased to 99.99%.
Kafka
- Starter plan: increase from 99.9% to 99.95% for availability target with best effort.
- Small plan: increase from 99.9% availability to 99.95%.
- Enterprise plan: increase from 99.95% availability to 99.99%.
- Critical plan: increase from 99.99% availability to 99.999%.

More details and the complete SLA can be found on the Policy documentation page on the Instaclustr website.

SLAs – Watch Out for the Shades of Gray