Examples

The following examples have the following spec fields in common:

  • version: the current version is "1.0"

  • sparkImage: the docker image that is used by job, driver and executor pods. This can be provided by the user.

  • mode: only cluster is currently supported

  • mainApplicationFile: the artifact (Java, Scala or Python) that forms the basis of the Spark job.

  • args: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.

  • sparkConf: these list spark configuration settings that are passed directly to spark-submit and which are best defined explicitly by the user. Since the SparkApplication "knows" that there is an external dependency (the s3 bucket where the data and/or the application is located) and how that dependency should be treated (i.e. what type of credential checks are required, if any), it is better to have these things declared together.

  • volumes: refers to any volumes needed by the SparkApplication, in this case an underlying PersistentVolumeClaim.

  • driver: driver-specific settings, including any volume mounts.

  • executor: executor-specific settings, including any volume mounts.

Job-specific settings are annotated below.

Pyspark: externally located dataset, artifact available via PVC/volume mount

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-image
  namespace: default
spec:
  image: docker.stackable.tech/stackable/ny-tlc-report:0.2.0 (1)
  sparkImage:
    productVersion: 3.5.2
  mode: cluster
  mainApplicationFile: local:///stackable/spark/jobs/ny_tlc_report.py (2)
  args:
    - "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (3)
  deps:
    requirements:
      - tabulate==0.8.9 (4)
  sparkConf: (5)
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
  job:
    config:
      resources:
        cpu:
          min: "1"
          max: "1"
        memory:
          limit: "1Gi"
  driver:
    config:
      resources:
        cpu:
          min: "1"
          max: "1500m"
        memory:
          limit: "1Gi"
  executor:
    replicas: 3
    config:
      resources:
        cpu:
          min: "1"
          max: "4"
        memory:
          limit: "2Gi"
1 Job image: this contains the job artifact that is retrieved from the volume mount backed by the PVC
2 Job python artifact (local)
3 Job argument (external)
4 List of python job requirements: these are installed in the Pods via pip.
5 Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)

JVM (Scala): externally located artifact and dataset

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-pvc
  namespace: default
spec:
  sparkImage:
    productVersion: 3.5.2
  mode: cluster
  mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.0-SNAPSHOT.jar (1)
  mainClass: org.example.App (2)
  args:
    - "'s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'"
  sparkConf: (3)
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
    "spark.driver.extraClassPath": "/dependencies/jars/*"
    "spark.executor.extraClassPath": "/dependencies/jars/*"
  volumes:
    - name: job-deps (4)
      persistentVolumeClaim:
        claimName: pvc-ksv
  driver:
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)
1 Job artifact located on S3.
2 Job main class
3 Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
4 the name of the volume mount backed by a PersistentVolumeClaim that must be pre-existing
5 the path on the volume mount: this is referenced in the sparkConf section where the extra class path is defined for the driver and executors

JVM (Scala): externally located artifact accessed with credentials

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-s3-private
spec:
  sparkImage:
    productVersion: 3.5.2
  mode: cluster
  mainApplicationFile: s3a://my-bucket/spark-examples.jar (1)
  mainClass: org.apache.spark.examples.SparkPi (2)
  s3connection: (3)
    inline:
      host: test-minio
      port: 9000
      accessStyle: Path
      credentials: (4)
        secretClass: s3-credentials-class
  sparkConf: (5)
    spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" (6)
    spark.driver.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar"
    spark.executor.extraClassPath: "/dependencies/jars/hadoop-aws-3.2.0.jar:/dependencies/jars/aws-java-sdk-bundle-1.11.375.jar"
  executor:
    replicas: 3
1 Job python artifact (located in an S3 store)
2 Artifact class
3 S3 section, specifying the existing secret and S3 end-point (in this case, MinIO)
4 Credentials referencing a secretClass (not shown in is example)
5 Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources…​
6 …​in this case, in an S3 store, accessed with the credentials defined in the secret

JVM (Scala): externally located artifact accessed with job arguments provided via configuration map

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cm-job-arguments (1)
data:
  job-args.txt: |
    s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv (2)
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: ny-tlc-report-configmap
  namespace: default
spec:
  sparkImage:
    productVersion: 3.5.2
  mode: cluster
  mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.1.0.jar (3)
  mainClass: tech.stackable.demo.spark.NYTLCReport
  volumes:
    - name: cm-job-arguments
      configMap:
        name: cm-job-arguments (4)
  args:
    - "--input /arguments/job-args.txt" (5)
  sparkConf:
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
  driver:
    config:
      volumeMounts:
        - name: cm-job-arguments (6)
          mountPath: /arguments  (7)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: cm-job-arguments (6)
          mountPath: /arguments (7)
1 Name of the configuration map
2 Argument required by the job
3 Job scala artifact that requires an input argument
4 The volume backed by the configuration map
5 The expected job argument, accessed via the mounted configuration map file
6 The name of the volume backed by the configuration map that is mounted to the driver/executor
7 The mount location of the volume (this contains a file /arguments/job-args.txt)