Stackable Operator for Apache Airflow

The Stackable Operator for Apache Airflow manages Apache Airflow instances on Kubernetes. Apache Airflow is an open-source application for creating, scheduling, and monitoring workflows. Workflows are defined as code, with tasks that can be run on a variety of platforms, including Hadoop, Spark, and Kubernetes itself. Airflow is a popular choice to orchestrate ETL workflows and data pipelines.

Getting started

Get started using Airflow with the Stackable Operator by following the Getting started guide. It guides you through installing the Operator alongside a PostgreSQL database and Redis instance, connecting to your Airflow instance and running your first workflow.

Custom resources

The AirflowCluster is the resource for the configuration of the Airflow instance. The resource defines three roles: webserver, worker and scheduler (the worker role is embedded within spec.celeryExecutors: this is described in the next section). The various configuration options are explained in the Usage guide. It helps you tune your cluster to your needs by configuring resource usage, security, logging and more.

Executors

The worker role is deployed when spec.celeryExecutors is specified (the alternative is spec.kubernetesExecutors, whereby pods are created dynamically as needed without jobs being routed through a redis queue to the workers). This means that for kubernetesExecutors there exists an implicit single role which does not appear in resource definition. This is illustrated below:

celeryExecutors

spec:
...
celeryExecutors:
  roleGroups:
    default:
      envOverrides:
        ...
      configOverrides:
        ...
      replicas: 2
  config:
    logging:
          ...

kubernetesExecutors

spec:
...
kubernetesExecutors:
  config:
    logging:
      ...
    resources:
      ...

Kubernetes resources

Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services.

A diagram depicting the Kubernetes resources created by the operator

The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other.

For every role group you define, the Operator creates a StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the main container running Airflow and a sidecar container gathering metrics for Monitoring. The Operator creates a Service per role group as well as a single service for the whole webserver role called <clustername>-webserver.

Additionally, a ConfigMap is created for each RoleGroup. These ConfigMaps contain two files, log_config.py and webserver_config.py, which contain logging and general Airflow configuration respectively.

Required external components

Airflow requires an SQL database in which to store its metadata as well as Redis for job execution. The required external components page lists all supported databases and Redis versions to use in production. You need to provide these components for production use, but the Getting started guides you through installing an example database and Redis instance with an Airflow instance that you can use to get started.

Redis is only needed if the executors have been set to spec.celeryExecutors as the jobs will be queued via Redis before being assigned to a worker pod. When using spec.kubernetesExecutors the scheduler will take direct responsibility for this.

Using custom workflows/DAGs

Direct acyclic graphs (DAGs) of tasks are the core entities you will use in Airflow. Have a look at the page on Mounting DAGs to learn about the different ways of loading your custom DAGs into Airflow.

Demo

You can install the airflow-scheduled-job demo and explore an Airflow installation, as well as how it interacts with Apache Spark.

Supported Versions

The Stackable Operator for Apache Airflow currently supports the following versions of Airflow:

  • 2.7.2

  • 2.6.3

  • 2.6.1 (deprecated)