Job Dependencies
Overview
With the platform release 23.4.1 and Apache Spark 3.3.x (and all previous releases), dynamic provisioning of dependencies using the Spark packages field doesn’t work.
This is a known problem with Spark and is tracked here.
|
The container images provided by Stackable include Apache Spark and PySpark applications and libraries.
In addition, they include commonly used libraries to connect to storage systems supporting the hdfs://
, s3a://
and abfs://
protocols. These systems are commonly used to store data processed by Spark applications.
Sometimes the applications need to integrate with additional systems or use processing algorithms not included in the Apache Spark distribution. This guide explains how you can provision your Spark jobs with additional dependencies to support these requirements.
Dependency provisioning
There are multiple ways to submit Apache Spark jobs with external dependencies. Each has its own advantages and disadvantages and the choice of one over the other depends on existing technical and managerial constraints.
To provision job dependencies in Spark workloads, you construct the SparkApplication
with one of the following dependency specifications:
-
Custom Spark images
-
Dependency volumes
-
Maven/Java packages
-
Python packages
The following table provides a high level overview of the relevant aspects of each method.
Dependency specification | Job image size | Reproduciblity | Dev-op cost |
---|---|---|---|
Custom Spark images |
Large |
Guaranteed |
Medium to High |
Dependency volumes |
Small |
Guaranteed |
Small to Medium |
Maven/Java packages |
Small |
Not guaranteed |
Small |
Python packages |
Small |
Not guaranteed |
Small |
Custom Spark images
With this method, you submit a SparkApplication
for which the sparkImage
refers to the full custom image name. It is recommended to start the custom image from one of the Stackable images to ensure compatibility with the Stackable operator.
Below is an example of a custom image that includes a JDBC driver:
FROM docker.stackable.tech/stackable/spark-k8s:3.5.1-stackable24.3.0 (1)
RUN curl --fail -o /stackable/spark/jars/postgresql-42.6.0.jar "https://jdbc.postgresql.org/download/postgresql-42.6.0.jar"
1 | Start from an existing Stackable image. |
And the following snippet showcases an application that uses the custom image:
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: spark-jdbc
spec:
sparkImage:
custom: "docker.stackable.tech/sandbox/spark-k8s:3.5.1-stackable0.0.0-dev" (1)
productVersion: "3.5.1" (2)
pullPolicy: IfNotPresent (3)
...
1 | Name of the custom image. |
2 | Apache Spark version. Needed for the operator to take the correct actions. |
3 | Optional. Defaults to Always . |
Dependency volumes
With this method, the job dependencies are provisioned from a PersistentVolume
as shown in this example:
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: example-sparkapp-pvc
namespace: default
spec:
sparkImage:
productVersion: 3.5.1
mode: cluster
mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.0-SNAPSHOT.jar (1)
mainClass: org.example.App (2)
args:
- "'s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'"
sparkConf: (3)
"spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
"spark.driver.extraClassPath": "/dependencies/jars/*"
"spark.executor.extraClassPath": "/dependencies/jars/*"
volumes:
- name: job-deps (4)
persistentVolumeClaim:
claimName: pvc-ksv
driver:
config:
volumeMounts:
- name: job-deps
mountPath: /dependencies (5)
executor:
replicas: 3
config:
volumeMounts:
- name: job-deps
mountPath: /dependencies (5)
1 | Job artifact located on S3. |
2 | Job main class |
3 | Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3, accessed without credentials) |
4 | the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing |
5 | the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors |
The Spark operator has no control over the contents of the dependency volume. It is your responsibility to make sure all required dependencies are installed in the correct versions. |
A PersistentVolumeClaim
and the associated PersistentVolume
can be defined like this:
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-ksv (1)
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
capacity:
storage: 2Gi
hostPath:
path: /some-host-location
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-ksv (2)
spec:
volumeName: pv-ksv (1)
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: batch/v1
kind: Job
metadata:
name: aws-deps
spec:
template:
spec:
restartPolicy: Never
volumes:
- name: job-deps (3)
persistentVolumeClaim:
claimName: pvc-ksv (2)
containers:
- name: aws-deps
volumeMounts:
- name: job-deps (4)
mountPath: /stackable/spark/dependencies
1 | Reference to a PersistentVolume , defining some cluster-reachable storage |
2 | The name of the PersistentVolumeClaim that references the PV |
3 | Defines a Volume backed by the PVC, local to the Custom Resource |
4 | Defines the VolumeMount that is used by the Custom Resource |
Maven packages
The last and most flexible way to provision dependencies is to use the built-in spark-submit
support for Maven package coordinates.
The snippet below showcases how to add Apache Iceberg support to a Spark (version 3.4.x) application.
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: spark-iceberg
spec:
sparkConf:
spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type: hive
spark.sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type: hadoop
spark.sql.catalog.local.warehouse: /tmp/warehouse
deps:
packages:
- org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 (1)
...
1 | Maven package coordinates for Apache Iceberg. This is downloaded from the Manven repository and made available to the Spark application. |
Currently it’s not possible to provision dependencies that are loaded by the JVM’s system class loader. Such dependencies include JDBC drivers. If you need access to JDBC sources from your Spark application, consider building your own custom Spark image as shown above. |
Python packages
When submitting PySpark jobs, users can specify additional Python requirements that are installed before the driver and executor pods are created.
Here is an example:
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
name: pyspark-report
spec:
mainApplicationFile: /app/run.py (1)
deps:
requirements:
- tabulate==0.8.9 (2)
...
1 | The main application file. In this example it is assumed that the file is part of a custom image. |
2 | A Python package that is used by the application and installed when the application is submitted. |