Examples

The following examples have the following spec fields in common:

version: the current version is "1.0"
sparkImage: the docker image that is used by job, driver and executor pods. This can be provided by the user.
mode: only cluster is currently supported
mainApplicationFile: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
args: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
sparkConf: these list spark configuration settings that are passed directly to spark-submit and which are best defined explicitly by the user. Since the SparkApplication "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.
volumes: refers to any volumes needed by the SparkApplication, in this case an underlying PersistentVolumeClaim.
driver: driver-specific settings, including any volume mounts.
executor: executor-specific settings, including any volume mounts.

Job-specific settings are annotated below.

Pyspark: externally located dataset, artifact available via PVC/volume mount

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-image
  namespace: default
spec:
  image: oci.stackable.tech/stackable/ny-tlc-report:0.2.0 (1)
  sparkImage:
    productVersion: 3.5.6
  mode: cluster
  mainApplicationFile: local:///stackable/spark/jobs/ny_tlc_report.py (2)
  args:
    - "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (3)
  deps:
    requirements:
      - tabulate==0.8.9 (4)
  sparkConf: (5)
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
  job:
    config:
      resources:
        cpu:
          min: "1"
          max: "1"
        memory:
          limit: "1Gi"
  driver:
    config:
      resources:
        cpu:
          min: "1"
          max: "1500m"
        memory:
          limit: "1Gi"
  executor:
    replicas: 3
    config:
      resources:
        cpu:
          min: "1"
          max: "4"
        memory:
          limit: "2Gi"

1	Job image: this contains the job artifact that is retrieved from the volume mount backed by the PVC
2	Job python artifact (local)
3	Job argument (external)
4	List of python job requirements: these are installed in the Pods via `pip`.
5	Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)

JVM (Scala): externally located artifact and dataset

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-pvc
  namespace: default
spec:
  sparkImage:
    productVersion: 3.5.6
  mode: cluster
  mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.0-SNAPSHOT.jar (1)
  mainClass: org.example.App (2)
  args:
    - "'s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'"
  sparkConf: (3)
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
    "spark.driver.extraClassPath": "/dependencies/jars/*"
    "spark.executor.extraClassPath": "/dependencies/jars/*"
  volumes:
    - name: job-deps (4)
      persistentVolumeClaim:
        claimName: pvc-ksv
  driver:
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)

1	Job artifact located on S3.
2	Job main class
3	Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
4	the name of the volume mount backed by a `PersistentVolumeClaim` that must be pre-existing
5	the path on the volume mount: this is referenced in the `sparkConf` section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact accessed with credentials

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-s3-private
spec:
  sparkImage:
    productVersion: 3.5.6
  mode: cluster
  mainApplicationFile: s3a://my-bucket/spark-examples.jar (1)
  mainClass: org.apache.spark.examples.SparkPi (2)
  s3connection: (3)
    inline:
      host: test-minio
      port: 9000
      accessStyle: Path
      credentials: (4)
        secretClass: s3-credentials-class
  sparkConf: (5)
    spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" (6)
    spark.driver.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar"
    spark.executor.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar"
  executor:
    replicas: 3

1	Job python artifact (located in an S3 store)
2	Artifact class
3	S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)
4	Credentials referencing a secretClass (not shown in is example)
5	Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources…
6	…in this case, in an S3 store, accessed with the credentials defined in the secret

JVM (Scala): externally located artifact accessed with job arguments provided via configuration map

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cm-job-arguments (1)
data:
  job-args.txt: |
    s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv (2)

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: ny-tlc-report-configmap
  namespace: default
spec:
  sparkImage:
    productVersion: 3.5.6
  mode: cluster
  mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.1.0.jar (3)
  mainClass: tech.stackable.demo.spark.NYTLCReport
  volumes:
    - name: cm-job-arguments
      configMap:
        name: cm-job-arguments (4)
  args:
    - "--input /arguments/job-args.txt" (5)
  sparkConf:
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
  driver:
    config:
      volumeMounts:
        - name: cm-job-arguments (6)
          mountPath: /arguments  (7)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: cm-job-arguments (6)
          mountPath: /arguments (7)

1	Name of the configuration map
2	Argument required by the job
3	Job scala artifact that requires an input argument
4	The volume backed by the configuration map
5	The expected job argument, accessed via the mounted configuration map file
6	The name of the volume backed by the configuration map that is mounted to the driver/executor
7	The mount location of the volume (this contains a file `/arguments/job-args.txt`)