First steps

Once you have followed the steps in the Installation section to install the operator and its dependencies, you will now create a Spark job. Afterwards you can verify that it works by looking at the logs from the driver pod.

Starting a Spark job

A Spark application is made of up three components:

Job: this builds a spark-submit command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
Driver: the driver starts the designated number of executors and removes them when the job is completed.
Executor(s): responsible for executing the job itself

Create a Spark application by running:

kubectl apply -f application.yaml

The application manifest file points to an application file that is to be started as well as it’s configuration and resources needed.

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: pyspark-pi (1)
  namespace: default
spec:
  sparkImage: (2)
    productVersion: 3.5.8
  mode: cluster (3)
  mainApplicationFile: local:///stackable/spark/examples/src/main/python/pi.py (4)
  job: (5)
    config:
      resources:
        cpu:
          min: "1"
          max: "2"
        memory:
          limit: "1Gi"
  driver: (6)
    config:
      resources:
        cpu:
          min: "1"
          max: "2"
        memory:
          limit: "1Gi"
  executor: (7)
    replicas: 1
    config:
      resources:
        cpu:
          min: "1"
          max: "2"
        memory:
          limit: "1Gi"

1	`metadata.name` contains the name of the SparkApplication
2	`spec.sparkImage`: the image used by the job, driver and executor pods. This can be a custom image built by the user or an official Stackable image. Available official images are stored in the Stackable image registry. Information on how to browse the registry can be found here.
3	`spec.mode`: only `cluster` is currently supported
4	`spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case an example python script (that calculates the value of pi) is running: it is bundled with the Spark code and therefore already present in the job image
5	`spec.job`: submit command specific settings.
6	`spec.driver`: driver-specific settings.
7	`spec.executor`: executor-specific settings.

Verify that it works

As mentioned above, the SparkApplication that has just been created builds a spark-submit command and pass it to the driver Pod, which in turn creates executor Pods that run for the duration of the job before being clean up. A running process looks like this:

pyspark-pi-xxxx: this is the initializing job that creates the spark-submit command (named as metadata.name with a unique suffix)
pyspark-pi-xxxxxxx-driver: the driver pod that drives the execution
pythonpi-xxxxxxxxx-exec-x: the set of executors started by the driver (in the example spec.executor.instances was set to 3 which is why 3 executors are running)

Job progress can be followed by issuing this command:

if kubectl wait pods -l 'job-name=pyspark-pi' \
  --for jsonpath='{.status.phase}'=Succeeded \
  --timeout 300s; then
  echo "job succeeded"
else
  echo "job failed"
  exit 1
fi

When the job completes the driver cleans up the executor. The initial job is persisted for several minutes before being removed. The completed state looks like this: