Stackable Operator for Apache Airflow

The Stackable Operator for Apache Airflow manages Apache Airflow instances on Kubernetes. Apache Airflow is an open-source application for creating, scheduling, and monitoring workflows. Workflows are defined as code, with tasks that can be run on a variety of platforms, including Hadoop, Spark, and Kubernetes itself. Airflow is a popular choice to orchestrate ETL workflows and data pipelines.

Getting started

Get started using Airflow with the Stackable Operator by following the Getting started guide. It guides you through installing the Operator alongside a PostgreSQL database and Redis instance, connecting to your Airflow instance and running your first workflow.

Resources

The Operator manages three custom resources: The AirflowCluster and AirflowDB. It creates a number of different Kubernetes resources based on the custom resources.

Custom resources

The AirflowCluster is the main resource for the configuration of the Airflow instance. The resource defines three roles: webserver, worker and scheduler. The various configuration options are explained in the Usage guide. It helps you tune your cluster to your needs by configuring resource usage, security, logging and more.

When an AirflowCluster is first deployed, an AirflowDB resource is created. The AirflowDB resource is a wrapper resource for the metadata SQL database that is used by Airflow to store information on users and permissions as well as workflows, task instances and their execution. The resource contains some configuration but also keeps track of whether the database has been initialized or not. It is not deleted automatically if a AirflowCluster is deleted, and so can be reused.

Kubernetes resources

Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services.

A diagram depicting the Kubernetes resources created by the operator

The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other. The Job created for the AirflowDB is not shown.

For every role group you define, the Operator creates a StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the main container running Airflow and a sidecar container gathering metrics for Monitoring. The Operator creates a Service per role group as well as a single service for the whole webserver role called <clustername>-webserver.

ConfigMaps are created, one per RoleGroup and also one for the AirflowDB. Both ConfigMaps contains two files: log_config.py and webserver_config.py which contain logging and general Airflow configuration respectively.

Dependencies

Airflow requires an SQL database in which to store its metadata. The Stackable platform does not have its own Operator for an SQL database but the Getting started guides you through installing an example database with an Airflow instance that you can use to get started.

Using custom workflows/DAGs

Direct acyclic graphs (DAGs) of tasks are the core entities you will use in Airflow. Have a look at the page on Mounting DAGs to learn about the different ways of loading your custom DAGs into Airflow.

Demo

You can install the airflow-scheduled-job demo and explore an Airflow installation, as well as how it interacts with Apache Spark.

Supported Versions

The Stackable Operator for Apache Airflow currently supports the following versions of Airflow:

  • 2.2.3

  • 2.2.4

  • 2.2.5

  • 2.4.1