This guide assumes you already have the demo
This demo will
Install the required Stackable operators
Spin up the follow data products
Postgresql: An open-source database used for Airflow cluster- and job-metadata.
Redis: An in-memory data structure store used for queuing Airflow jobs
Airflow: An open-source workflow management platform for data engineering pipelines.
Mount two Airflow jobs (referred to as Directed Acyclic Graphs, or DAGs) for the cluster to use
Enable and schedule the jobs
Verify the job status with the Airflow Webserver UI
You can see the deployed products as well as their relationship in the following diagram:
To list the installed Stackable services run the following command:
$ stackablectl services list PRODUCT NAME NAMESPACE ENDPOINTS EXTRA INFOS airflow airflow default webserver-airflow http://172.18.0.2:31979 Admin user: airflow, password: airflow
When a product instance has not finished starting yet, the service will have no endpoint. Starting all the product instances might take a considerable amount of time depending on your internet connectivity. In case the product is not ready yet a warning might be shown.
Superset gives the ability to execute SQL queries and build dashboards.
webserver-airflow in your browser (
http://172.18.0.2:31979 in this case).
Log in with the username
airflow and password
airflow. The overview screen shows the DAGs that have been mounted during the demo set-up (
There are two things to notice here. Both DAGs have been enabled, as shown by the slider to the left of the DAG name (DAGs are initially all
paused and can be activated either manually in the UI or via a REST call, as done in the setup for this demo):
date_demo job has been busy, with several runs already logged! The
sparkapp_dag has only been run once. This is because they have been defined with different schedules.
Clicking on the number under
Runs will display the individual job runs:
demo_date job is running every minute. With Airflow, DAGs can be started manually or scheduled to run when certain conditions are fulfilled: in this case the DAG has been set up to run using a cron table, which is part of the DAG definition.
Let’s drill down a bit deeper into this DAG. Click on one of the individual job runs shown in the previous step to display the job details. The DAG is displayed in the form of a graph (this job is so simple that it only has one step, called
In the top right-hand corner there is some scheduling information, which tells us that this job will run every minute continuously:
Click on the
run_every_minute box in the centre of the page and then select
This will navigate to the worker where this job was run (with multiple workers the jobs will be queued and distributed to the next free worker) and display the log. In this case the output is a simple printout of the timestamp:
To look at the actual DAG code click on
Code. Here we can see the crontab information used to schedule the job as well the
bash command that provides the output:
Go back to DAG overview screen. The
sparkapp_dag job has a scheduled entry of
None and a last-execution time (
2022-09-19, 07:36:55). This allows a DAG to be executed exactly once, with neither schedule-based runs nor any backfill. The DAG can always be triggered manually again via REST or from within the Webserver UI.
By navigating to the graphical overview of the job we can see that DAG has two steps, one to start the job - which runs asynchronously - and another to poll the running job to report on its status.
The logs for the first task -
spark-pi-submit - indicate that it has been started, at which point the task exits without any further information:
The second task -
spark-pi-monitor - polls this job and waits for a final result (in this case:
Success). In this case, the actual result of the job (a value of
pi) is logged by Spark in its driver pod, but more sophisticated jobs would persist this in a sink (e.g. a Kafka topic or HBase row) or use the result to trigger subsequent actions.