Stackable Operator for Apache Airflow
The Stackable Operator for Apache Airflow manages Apache Airflow instances on Kubernetes. Apache Airflow is an open-source application for creating, scheduling, and monitoring workflows. Workflows are defined as code, with tasks that can be run on a variety of platforms, including Hadoop, Spark, and Kubernetes itself. Airflow is a popular choice to orchestrate ETL workflows and data pipelines.
Get started using Airflow with the Stackable Operator by following the Getting started guide. It guides you through installing the Operator alongside a PostgreSQL database and Redis instance, connecting to your Airflow instance and running your first workflow.
The AirflowCluster is the resource for the configuration of the Airflow instance. The resource defines three roles:
worker role is embedded within
spec.celeryExecutors: this is described in the next section). The various configuration options are explained in the Usage guide. It helps you tune your cluster to your needs by configuring resource usage, security, logging and more.
worker role is deployed when
spec.celeryExecutors is specified (the alternative is
spec.kubernetesExecutors, whereby pods are created dynamically as needed without jobs being routed through a redis queue to the workers). This means that for
kubernetesExecutors there exists an implicit single role which does not appear in resource definition. This is illustrated below:
spec: ... celeryExecutors: roleGroups: default: envOverrides: ... configOverrides: ... replicas: 2 config: logging: ...
Based on the custom resources you define, the Operator creates ConfigMaps, StatefulSets and Services.
The diagram above depicts all the Kubernetes resources created by the operator, and how they relate to each other.
For every role group you define, the Operator creates a
StatefulSet with the amount of replicas defined in the RoleGroup. Every Pod in the StatefulSet has two containers: the
main container running Airflow and a sidecar container gathering metrics for Monitoring. The
Operator creates a Service per role group as well as a single service for the whole
webserver role called
Additionally, a ConfigMap is created for each RoleGroup. These ConfigMaps contain two files,
webserver_config.py, which contain logging and general Airflow configuration respectively.
Airflow requires an SQL database in which to store its metadata as well as Redis for job execution. The required external components page lists all supported databases and Redis versions to use in production. You need to provide these components for production use, but the Getting started guides you through installing an example database and Redis instance with an Airflow instance that you can use to get started.
Redis is only needed if the executors have been set to