The workflow management ecosystem is heating up in popularity, and developers are starting to ask themselves which solution is the best for them.
In this article we will take a high-level look at two of these technologies: Apache Airflow and Cadence, and we’ll explore what makes them tick and perhaps find out which one might be right for you.
What Is Apache Airflow?
Apache Airflow is a workflow management solution that grew out of Airbnb. The creators of Airflow were managing more complex and diverse workflows and looked around the market to find solutions that would help simplify their lives, and when nothing appropriate was found, they built one themselves.
From the outset, Airflow has been an open source project and in 2016 entered Apache Incubator and over time has finally evolved into what it is today: a mature, widely used workflow management solution.
Apache Airflow allows users to develop a workflow definition, a DAG – Directed Acyclic Graph, which are written as Python configuration files.
Airflow is written in Python.
What Is Cadence?
Cadence is a fault-oblivious stateful code platform and workflow engine, and has similar origins to Apache Airflow. It grew out of the Uber engineering team to help service the enormous growth that company has experienced over the years.
Cadence is completely open source, and the Cadence team at Uber announced long term support for Cadence in the coming years, and the developer community is growing.
Cadence workflows are developed as imperative code using the Cadence client SDK, which supports Java and Go natively, and Python and Ruby via third party projects.
Cadence is written in Go.
Instaclustr offers a Managed Cadence service, which takes the hassle out of setting up and maintaining a cluster.
Airflow Architecture
An Apache Airflow cluster can be configured in multiple different ways, but large scale deployments are most commonly deployed as a collection of running applications
1. Apache Airflow
Airflow itself has multiple component services
a. Scheduler service – Responsible for polling the meta store for registered DAGs and queueing them for execution if required
b. Web Server – A HTTP user interface giving users the ability to interact with their workflows.
c. Executor server – Perform the work defined in the DAGs, execute the logic and return the results.
2. Metastore
A SQL database that stores the DAG files. Postgres, MySQL & SQLite supported.
3. Queuing service
Airflow uses the Celery asynchronous task queue as a distributed executor, and it requires a back end be configured. RabbitMQ and Redis are supported.
Control Flow
Airflow workflow definitions, or DAGs, are expressed as a collection of tasks. The tasks are configured with explicit dependencies, and can be set up to execute in parallel or obey a strict order.
DAGs are submitted to Airflow and stored in the metastore.
It is the job of the scheduler to check all the known DAGs, and schedule the workflow tasks that are due to be executed.
The task is queued, and the subscribed executor workers execute whatever task they are assigned. The executors run the Python code in the task, and return the result.
User Interface
Airflow offers a user interface, where a user can view DAG definitions, visualise the workflow tasks, see running workflows and get a detailed view of particular workflows.
Airflow Use Cases
Airflow is commonly implemented as a data pipelining service. Use cases include ETL jobs (Extract, Transform, Load) and data consolidation jobs that run on a periodic schedule.
Airflow is a popular tool in machine learning where it is used to build and validate training data for AI models.
Airflow Pros and Cons
Airflow is a widely used and mature application, with a healthy third party provider support ecosystem.
As it’s written in Python, users have access to any of the libraries they need to implement their workflow logic.
Airflow doesn’t inherently handle failures and retries, they need to be handled by the code written into the DAG file, this can add a lot of boilerplate code to workflow definitions and make them difficult to maintain.
Airflow DAGs must be written in Python which may limit the appeal for some environments.
Clusters require multiple applications and infrastructure running altogether, which could make it difficult to set up, configure and manage.