NetApp Closes Acquisition of Instaclustr Read the announcement
  • Technical
Implementing AWS Managed Workers for Apache Airflow (MWAA) at Instaclustr

In this blog post I’m going to detail the set up of AWS Managed Workers for Apache Airflow (MWAA) within an existing network. Yes, you read that right—even a managed open source software services company sometimes uses managed open source software services!

While there are many tutorials available for setting up MWAA from a brand new VPC and network, including using Cloudformation or Terraform, like me, you may not be setting up a brand new VPC and network from scratch. Instead want to use it with your existing network, possibly with other layers of security in place. To create this setup I largely used AWS’ own quickstart for MWAA tutorial, and will detail a few aspects of the implementation that weren’t as simple as described, in the hope that you can avoid any problems.

Network Access for Airflow Workers

Part of our internal security includes an egress firewall with a whitelist of endpoints defined by their IP, for individual infrastructure resources in our cloud to access. Due to MWAA dynamically creating workers in a subnet, IPs can’t be used as an identifier. Fortunately, this tutorial points us to VPC endpoints, which can be used to allow a full “security group” access to specific endpoints within AWS, such as the MWAA API point for provisioning, or allows access by the Key Management Service (KMS) to MWAA so that Airflow could use our keys. No problem right? Set up VPC endpoints as per the tutorial, add the Airflow security group to them and everything will work!

Figure 1. Firewall whitelist by IP and by endpoint

Well, we did, and deployed to our staging environment to soak over the weekend. The environment spun up correctly, and I could interact with the front end (Figure 2). Success! However, over the weekend more and more alarms were alerting for various other services. It turns out that when you make a VPC endpoint for a particular AWS service, then everything in the VPC will attempt to use that endpoint, and will not fall back to using the standard public IPs for the services if the endpoint policies block access to a required resource. So, for a weekend, none of our systems could access various other endpoints.

Figure 2. All traffic routed through the VPC endpoint

Our solution was to whitelist whole subnets for various internal Instaclustr AWS systems (logging, KMS) so that any MWAA worker in these subnets would have access, while using VPC endpoints for more specific MWAA things, such as the MWAA environment (Figure 3). By whitelisting the whole subnet for key AWS services we were able to provide access for MWAA and all our existing services through our firewall, using VPC endpoints only for unique MWAA access.

Figure 3. VPC endpoint with MWAA-only access

What Else Needs Network Access?

Internally we use a CI/CD tool to deploy our infrastructure via Terraform. This includes AWS VPC endpoints and security groups. The AWS Airflow API for MWAA is whitelisted through our egress gateway. Can you guess what happened when our CD tool created the VPC endpoint? It tried to route its own traffic through that endpoint, found it was blocked and just waited—forever—hoping that the environment would eventually reach out and say it was provisioned. This never happened so the traffic was blocked at the VPC endpoint and refused to use the egress firewall. 

I could see that the environment had spun up correctly, but without access to the API endpoint the CD tool would never know the correct state. Meaning, if we ever wanted to update the MWAA environment, then we would have to destroy nearly everything and start over. As it turns out we did do this a couple of times to get the environment into a happy state and get everything reported correctly.

Once created, the MWAA environment needed access to several different databases. The first was our internal database, which was easy to set up with a security group login, that can restrict the tables the Airflow runners can access. Now we are moving smoothly through the implementation. The other was our reporting database, running as an Instaclustr managed PostgreSQL® server. This was simple to set up using a VPC peer from the Instaclustr VPC to our internal VPC where MWAA sits, all done with simple Terraform structures.

Figure 4. Using VPC Peering

I hope I have given you a taste of some of the challenges that can come with using tutorials to integrate new software into what you already have. There is often a joke that software developers just copy and paste code off stack-overflow or tutorials. In practice, taking some time to understand implementation specific to your own system and configuration is important, as real-life implementations are rarely exactly the same as what you find online.

Fully managed and integrated open-source data layer technologies on AWS cloud infrastructure.

Learn More