A question that we sometimes get from potential customers is “why would I need a managed service when I could just use a Kubernetes operator?”

While some of the answers to this are reasonably obvious (for example, someone else deals with the alerts in the middle of the night) some are more subtle. These same answers also point to considerations when thinking about whether using Kubernetes and a Kubernetes operator is the right choice in general when planning your management approach for open source data technologies such as Apache Kafka, Apache Cassandra, and OpenSearch.

Managing open source data technologies is a topic that we are very familiar with at NetApp Instaclustr. This blog will set out a range of things that you should consider when thinking about using a Kubernetes operator as your open source data technology and what to consider when comparing it against using a managed service.

The clear additional benefits of a managed service

At the core of most modern managed services (and certainly the NetApp Instaclustr managed service) is an automated provisioning and management control plane which in many ways performs a function similar to a Kubernetes operator. However, using NetApp Instaclustr as an example, there are some key additional benefits provided by a managed service vs. running the technology yourself with Kubernetes:

Built-in support team and SLAs: Most modern applications need to run 24×7. This means you need a support/SRE/ops team ready and equipped to respond to issues 24×7. With a managed service, this team is built-in and driven to achieve agreed SLAs, relieving you of the need to recruit, pay for and retain the scarce and expert staff needed for these roles.

  1. Ready to go: A Kubernetes operator for your chosen technology will likely come packaged with a lot of the tools you need for an operational environment. However, you still need to work out how to configure your environment, add tools for monitoring, and develop operational procedures, as well as to test and train your team. Even then, if this is a new environment, you’ll inevitably encounter some “learnings” that come in the form of early production issues. By contrast, if you choose an established managed service, you’re getting a proven operational environment, an experienced team, and procedures ready to go with effectively no lead time.
  2. Security management: In any modern environment, configuring and managing security is a major piece of work and requires specific attention for each technology that you run. Most managed services will have pre-existing security certifications (for example NetApp Instaclustr has SOC 2, PCI, and ISO27001) that mean you don’t need to do the analysis and configuration yourself to achieve a secure and compliant configuration. In addition, the managed service should take on the load of regular patching as well as reviewing new CVEs as they are reported, while determining and executing an appropriate plan of action.
  3. Shared learnings: While you will learn a lot in the early days of running one of these technologies, there are always things that come up much less frequently and you’ll end up learning along the way. If you are using a managed service, then chances are that an experienced team will have already seen the issue before you encounter it. They will know how to deal with it (or have already taken steps to prevent it) as opposed to you having to figure it out on your own. Even if it’s an issue that you’ve seen before, maintaining institutional knowledge about how to deal with an issue that you see every year or two is hard, where a managed service provider may see this type of issue weekly or monthly and will be well practiced at dealing with or preventing it. To give one example: (getting close to) running out of disk space on an instance store-based node is a situation that can be thorny to deal with if you don’t have plans and tools in place. NetApp Instaclustr has developed through experience, tools and techniques to seamlessly and transparently deal with issues like this.
  4. Core software support: Finally, the examples above are all related to how you operate the software, but what if you are impacted by issues with the core open source technology itself? You need a plan for diagnosing and fixing the bug, an element that is included in the Instaclustr managed service. Your managed service provider should include this as part of their service as well, reducing the risk to you. However, be careful to read the fine print as not all managed services include this scope.

The (potentially) hidden costs of a Kubernetes operator

In addition to the clear benefits you receive when using a managed service, there are hidden costs of relying on Kubernetes operator and running it yourself that you should consider, including:

  1. Additional complexity: Kubernetes operators used for complex, distributed software are complex applications in their own right. For example, the popular Strimzi operator for Kafka is over 480,000 lines of code and its documentation is over 500 pages long. For reliable and non-trivial use of the operator, you will need to become familiar with much of this detail to avoid unintended side effects.
  2. Edge case management: It is a well-established axiom that automating an activity will often make it harder to undertake a slight variation of that activity than if the task was manual. For example, a common operation in Apache Cassandra® is adding nodes to a cluster. This requires adding each node, waiting for the data to be copied to the nodes and then running an operation called “clean up” on all the other nodes in the cluster to remove data that they no longer need. The k8ssandra operator for Cassandra automates these steps. However, what if I wanted to add three nodes but just add one node per night to minimize impact in peak times and then run cleanups (which is a fairly time consuming and resource intensive operation) once all three were done? There doesn’t appear to be any way to do this with the operator, so you’d be back to figuring out how to do it manually and then (carefully!) trying to get the operator config back in sync with the running cluster. While these edge case issues still exist with a managed service control plane, it’s the support team’s job to know how to deal with them and shield users from the impact.
  3. Limited commonality between operators: Most operators are completely separate open source projects. So, while they generally follow standard Kubernetes patterns, each group of authors bring their own assumptions and working models when designing the functionality of the operator. This limits the extent to which you can take learning from an operator for one technology and apply them to the operator for a different technology. By contrast, a key design goal for NetApp Instaclustr (and I imagine other managed services offering multiple technologies) is to maximize the similarity of experience from one technology to the next. To provide one example of this: using the Strimzi operator for Apache Kafka, you can add nodes to your cluster by increasing the value of the “Kafka.spec.kafka.replicas” property (and apply) or using the kubectl scale command. The k8ssandra operator for Apache Cassandra on the other hand has a property called “size” and does not appear to support the kubectl scale command. While this is one relatively trivial example, there are many such examples and more complex examples such as the behavior of individual operations.
  4. Still need (even more) support: Using a Kubernetes operator to automate operations for your open source data technology doesn’t take away the need for support of the core technology. If you’re impacted by a bug (or just don’t know how to do something), will you maintain the expertise to dive into the code and figure it out yourself, rely on community support, or pay for commercial support? This also extends to having answers to the same questions for the operator itself. The NetApp Instaclustr managed service by contrast includes not only complete support for the control plane functionality but also for the core open source software.

It’s also worth noting, while not directly related to the use of an operator, that our performance testing has shown that the overhead of Kubernetes can have a performance impact, particularly on lightweight operations such as inserts in Apache Cassandra.

Conclusion

Modern data technology is complex and while using a Kubernetes operator can mask much of that complexity for typical day to day operations, it does not remove the need to deal with that complexity as soon as things inevitably move away from the happy path. A managed service, such as NetApp Instaclustr, ensures that the core control plane automation is backed by a team with the experience, expertise, and motivation to take care of all eventualities and ensure the ongoing reliable operation of your data layer technology.

Request a free demo today and experience the advantages of a managed open source platform.