Job Dependencies

Overview

With the platform release 23.4.1 and Apache Spark 3.3.x (and all previous releases), dynamic provisioning of dependencies using the Spark packages field doesn’t work. This is a known problem with Spark and is tracked here.

The container images provided by Stackable include Apache Spark and PySpark applications and libraries. In addition, they include commonly used libraries to connect to storage systems supporting the hdfs://, s3a:// and abfs:// protocols. These systems are commonly used to store data processed by Spark applications.

Sometimes the applications need to integrate with additional systems or use processing algorithms not included in the Apache Spark distribution. This guide explains how you can provision your Spark jobs with additional dependencies to support these requirements.

Dependency provisioning

There are multiple ways to submit Apache Spark jobs with external dependencies. Each has its own advantages and disadvantages and the choice of one over the other depends on existing technical and managerial constraints.

To provision job dependencies in Spark workloads, you construct the SparkApplication with one of the following dependency specifications:

  • Custom Spark images

  • Dependency volumes

  • Maven/Java packages

  • Python packages

The following table provides a high level overview of the relevant aspects of each method.

Dependency specification Job image size Reproduciblity Dev-op cost

Custom Spark images

Large

Guaranteed

Medium to High

Dependency volumes

Small

Guaranteed

Small to Medium

Maven/Java packages

Small

Not guaranteed

Small

Python packages

Small

Not guaranteed

Small

Custom Spark images

With this method, you submit a SparkApplication for which the sparkImage refers to the full custom image name. It is recommended to start the custom image from one of the Stackable images to ensure compatibility with the Stackable operator.

Below is an example of a custom image that includes a JDBC driver:

FROM docker.stackable.tech/stackable/spark-k8s:3.5.1-stackable24.3.0 (1)

RUN curl --fail -o /stackable/spark/jars/postgresql-42.6.0.jar "https://jdbc.postgresql.org/download/postgresql-42.6.0.jar"
1 Start from an existing Stackable image.

And the following snippet showcases an application that uses the custom image:

apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: spark-jdbc
spec:
  sparkImage:
    custom: "docker.stackable.tech/sandbox/spark-k8s:3.5.1-stackable0.0.0-dev" (1)
    productVersion: "3.5.1" (2)
    pullPolicy: IfNotPresent (3)
...
1 Name of the custom image.
2 Apache Spark version. Needed for the operator to take the correct actions.
3 Optional. Defaults to Always.

Dependency volumes

With this method, the job dependencies are provisioned from a PersistentVolume as shown in this example:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-pvc
  namespace: default
spec:
  sparkImage:
    productVersion: 3.5.1
  mode: cluster
  mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.0-SNAPSHOT.jar (1)
  mainClass: org.example.App (2)
  args:
    - "'s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'"
  sparkConf: (3)
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
    "spark.driver.extraClassPath": "/dependencies/jars/*"
    "spark.executor.extraClassPath": "/dependencies/jars/*"
  volumes:
    - name: job-deps (4)
      persistentVolumeClaim:
        claimName: pvc-ksv
  driver:
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)
1 Job artifact located on S3.
2 Job main class
3 Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3, accessed without credentials)
4 the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
5 the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors
The Spark operator has no control over the contents of the dependency volume. It is your responsibility to make sure all required dependencies are installed in the correct versions.

A PersistentVolumeClaim and the associated PersistentVolume can be defined like this:

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-ksv (1)
spec:
  storageClassName: standard
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 2Gi
  hostPath:
    path: /some-host-location
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-ksv (2)
spec:
  volumeName: pv-ksv (1)
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: batch/v1
kind: Job
metadata:
  name: aws-deps
spec:
  template:
    spec:
      restartPolicy: Never
      volumes:
        - name: job-deps (3)
          persistentVolumeClaim:
            claimName: pvc-ksv (2)
      containers:
        - name: aws-deps
          volumeMounts:
            - name: job-deps (4)
              mountPath: /stackable/spark/dependencies
1 Reference to a PersistentVolume, defining some cluster-reachable storage
2 The name of the PersistentVolumeClaim that references the PV
3 Defines a Volume backed by the PVC, local to the Custom Resource
4 Defines the VolumeMount that is used by the Custom Resource

Maven packages

The last and most flexible way to provision dependencies is to use the built-in spark-submit support for Maven package coordinates.

The snippet below showcases how to add Apache Iceberg support to a Spark (version 3.4.x) application.

apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: spark-iceberg
spec:
  sparkConf:
    spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
    spark.sql.catalog.spark_catalog.type: hive
    spark.sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.local.type: hadoop
    spark.sql.catalog.local.warehouse: /tmp/warehouse
  deps:
    packages:
      - org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 (1)
...
1 Maven package coordinates for Apache Iceberg. This is downloaded from the Manven repository and made available to the Spark application.
Currently it’s not possible to provision dependencies that are loaded by the JVM’s system class loader. Such dependencies include JDBC drivers. If you need access to JDBC sources from your Spark application, consider building your own custom Spark image as shown above.

Python packages

When submitting PySpark jobs, users can specify additional Python requirements that are installed before the driver and executor pods are created.

Here is an example:

apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: pyspark-report
spec:
  mainApplicationFile: /app/run.py (1)
  deps:
    requirements:
      - tabulate==0.8.9  (2)
...
1 The main application file. In this example it is assumed that the file is part of a custom image.
2 A Python package that is used by the application and installed when the application is submitted.