NetApp recently started using Kubernetes with our fleet automation tooling on the NetApp Instaclustr Platform. It took us a few attempts to get the right setup for our purposes. This blog is about how we started, what didn’t work for us, and how we fixed it.

The choice to introduce Kubernetes into our architecture was driven largely by the Red Hat open source Ansible AWX project announcing that the future of Ansible AWX is Kubernetes. NetApp Instaclustr Ansible AWX services orchestrate the automation of some of our fleet management, including responding to an urgent vulnerability or performing some routine software patching. Our Ansible AWX setup has been running very smoothly in Docker containers on AWS EC2 instances for 7+ years. Without the architectural change from the AWX Project, we wouldn’t have moved away from Docker. If it ain’t broke, don’t fix it!

Our product and development teams explored options for our AWX deployment to run with Kubernetes. Kubernetes can be powerful, but it adds a lot of operational overhead. We initially chose AWS Fargate with EKS, aiming to reduce that operational overhead of managing Kubernetes infrastructure.

This decision was informed by resources such as My Learning: AWS Fargate with EKS, the AWS EKS User Guide, and Getting Started with EKS. These guides suggested that combining Fargate with EKS could simplify Kubernetes management.

Although the benefits of AWS Fargate with EKS were promising, our trial did not succeed. We reverted to running EKS workers on self-managed AWS EC2 instances instead. Here’s why Fargate didn’t meet our requirements:

  1. Fargate provisioning worker nodes without any RAM and memory resource requirement inputs resulted in under-provisioned nodes that crippled any of our larger AWX jobs.
  2. Fargate on-demand node provisioning created a significant job start delay to AWX jobs.

The observations below are specific to our internal automation use cases, infrastructure choices, and configuration decisions, and should not be interpreted as universal performance findings about any of the services described, and results may vary for different workloads.

Provisioning opacity

Fargate and EKS lacked visibility into AWX jobs, so the on‑demand model failed to allocate CPU and memory that met the job resource needs. Fargate’s on-demand node provisioning often resulted in under-provisioned resources, especially RAM.

AWX tasks that targeted many instances ran over five times slower on Fargate with EKS compared to our Docker setup, and some resource-heavy Ansible AWX jobs failed entirely due to Fargate’s strict ulimit settings.

Shifting the provisioning overhead to AWS eliminated our operational control and made job failures highly unpredictable for Support Engineers.

Frequent refresh of inventory metadata

Those that are familiar with the AWX setup know that a shorter cache timeout means AWX will need to gather inventory metadata more often, potentially increasing job run times. Conversely, a longer timeout improves performance but might lead to using outdated information. It’s a balancing act depending on the needs of the playbook use case.

For most of our use cases we don’t want the cached inventory to be too old, and up to date/current inventory metadata is needed to understand a large amount of context about the target of our AWX jobs. Ideally a cache refresh would be under about 30 minutes. Bear in mind that this refresh of inventory metadata can, at times, be for over 20,000 targets.

With our previous setup on docker images, we could take the time hit and refresh cache where we needed an extremely up to date inventory, and for other playbooks they could be run with the cached version so that the job could start with the least possible delay.

So, in addition to the provisioning opacity, the “on-demand” model broke our operations because inventory metadata can’t be cached. This meant the metadata had to be refreshed more often, so we were constantly refreshing—even when a slightly out of date version would have been fine.

Refresh time became another problem. Previously we could tolerate a short delay and refresh before running the playbook. With Fargate that delay increased from about 8 times, and in the worst case, up to 11 times slower.

Job start delay for short operations

We also found that the shift to Fargate with EKS introduced a job start delay with the provisioning latency of Fargate resources. Due to the variable and sometimes sudden increase in AWX task activity when responding to issues, any Fargate scaling operation or pod replacement was subject to provisioning overhead in the time taken to acquire capacity, pull the AWX container image, and attach the network interfaces meaning there was a notable delay before a playbook would even begin to run.

Typical job start time with docker containers on EC2 instances had no time needed for provisioning the instance, the infrastructure was already there. We observed that certain AWX jobs experienced an additional job start delay before any Ansible is executed, which contributed a significant overhead on our operations.

AWX playbook Execution duration for AWX on Fargate/EKS
Apache Cassandra keyspace command—add CDC Runtime became up to 8 times slower
Gather fact—all gateway servers Runtime became up to 11 times slower

What we saw was a simple short running operation that went from seconds to complete to multiple minutes just to start. The additional wait time was ultimately unacceptable for our operational engineers running the AWX playbooks that maintain our fleet, particularly when run at scale.

At this point the negative impact to operation run times outweighed the benefit of the serverless architecture, and we accepted that the serverless setup wasn’t suitable for this operational use case. We discovered that, we did need control of resource planning, particularly for Ansible AWX automation. We’re still considering where the benefits of serverless instances would work for other components of our control plane and operational tooling.

Reversing inventory refresh delay

When we moved the workload back onto our own AWS EC2 instances, the inventory refresh delay immediately returned to a short and tolerable duration. Because these EC2 hosts are always running, AWX no longer needed to wait for new instances to be provisioned before performing the inventory sync, which effectively removed the provisioning time from the workflow.

However, this improvement comes with a trade‑off. By relying on EC2 instances that remain online continuously, we shift to an “always‑on” cost model for the supporting infrastructure. This means we gain faster execution and predictable performance, but at the expense of paying for the compute resources even when they are idle.

We utilize our internal AMIs to provision, monitor, and patch the self-managed worker nodes. New AWX task pods are scheduled on existing running nodes, and this shift immediately addressed every bottleneck we had identified, job-start latency vanished, and the startup delay was cut from minutes down to mere seconds.

Reversing job start delay

Our product and development team have now implemented a combination of AWS EC2 compute-optimized self-managed Kubernetes worker instances with AWS EKS managing the Kubernetes cluster that are a better fit for our Ansible jobs.

Here is the same scenario we outlined above with data on execution duration for AWX on EC2/EKS

AWX playbook Execution duration for AWX on Fargate/EKS Execution duration for AWX on EC2/EKS
Apache Cassandra Keyspace command—add CDC Runtime became up to 8 times slower Runtime returned to original timing
Gather fact—all gateway servers Runtime became up to 11 times slower Runtime was reduced but is now up to 3 times quicker

The way we work today

How do things look since we moved away from the Fargate approach that had introduced operational delays, inconsistent performance, and additional overhead into the AWX workflow?

We are now running Ansible AWX on Kubernetes, meeting the new technical requirements introduced by Ansible that is the core success criteria of this work. By returning to a self‑managed EC2 fleet, downtime and manual intervention have become far less frequent, giving us a more stable platform overall. Moving away from Fargate allowed us to remove these constraints and return to an execution model that better supports the requirements of AWX, particularly around predictable run times and steady controller availability. This setup gives our operations team a robust, predictable, and high‑performance AWX environment and has made our workflows faster and less error‑prone.

The NetApp Instaclustr Platform delivers you reliability at scale and frees your teams from the overhead of IT operation. Read Top 10 Reasons Why Enterprises Choose NetApp Instaclustr, or try it for yourself—start a free trial today on the NetApp Instaclustr Managed Platform and discover reliability at scale.