Spark History Server

The Stackable Spark-on-Kubernetes operator runs Apache Spark workloads in a Kubernetes cluster, whereby driver- and executor-pods are created for the duration of the job and then terminated. One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an endpoint for Spark logging, so that job information can be viewed once the job pods are no longer available.

Deployment

The example below demonstrates how to set up the history server running in one Pod with scheduled cleanups of the event logs. The event logs are loaded from an S3 bucket named spark-logs and the folder eventlogs/. The credentials for this bucket are provided by the secret class s3-credentials-class. For more details on how the Stackable Data Platform manages S3 resources see the S3 resources page.

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkHistoryServer
metadata:
  name: spark-history
spec:
  image:
    productVersion: 3.5.8
  logFileDirectory: (1)
    s3:
      prefix: eventlogs/ (2)
      bucket: (3)
        inline:
          bucketName: spark-logs
          connection:
            inline:
              host: test-minio
              port: 9000
              accessStyle: Path
              credentials:
                secretClass: history-credentials-class
  sparkConf: (4)
  nodes:
    roleGroups:
      default:
        replicas: 1 (5)
        config:
          cleaner: true (6)

1	The location of the event logs, see Supported file systems for storing log events for other options.
2	Directory within the S3 bucket where the log files are located. This directory is required and must exist before setting up the history server.
3	The S3 bucket definition, here provided in-line.
4	Additional history server configuration properties can be provided here as a map. For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
5	This deployment has only one Pod. Multiple history servers can be started, all reading the same event logs by increasing the replica count.
6	This history server automatically cleans up old log files by using default properties. Change any of these by using the `sparkConf` map.

Only one role group can have scheduled cleanups enabled (cleaner: true) and this role group cannot have more than 1 replica.

The secret with S3 credentials must contain at least the following two keys:

accessKey — the access key of a user with read and write access to the event log bucket.
secretKey — the secret key of a user with read and write access to the event log bucket.

Any other entries of the Secret are ignored by the operator.

Spark application configuration

The example below demonstrates how to configure Spark applications to write log events to an S3 bucket.

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: spark-pi-s3-1
spec:
  sparkImage:
    productVersion: 3.5.8
    pullPolicy: IfNotPresent
  mode: cluster
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: s3a://my-bucket/spark-examples.jar
  s3connection: (1)
    inline:
      host: test-minio
      port: 9000
      accessStyle: Path
      credentials:
        secretClass: s3-credentials-class (2)
  logFileDirectory: (3)
    s3:
      prefix: eventlogs/ (4)
      bucket:
        inline:
          bucketName: spark-logs (5)
          connection:
            inline:
              host: test-minio
              port: 9000
              accessStyle: Path
              credentials:
                secretClass: history-credentials-class (6)
  executor:
    replicas: 1

1	Location of the data that is being processed by the application.
2	Credentials used to access the data above.
3	Instruct the operator to configure the application with logging enabled.
4	Folder to store logs. This must match the prefix used by the history server.
5	Bucket to store logs. This must match the bucket used by the history server.
6	Credentials used to write event logs. These can, of course, differ from the credentials used to process data.

Supported file systems for storing log events

S3

As already shown in the example above, the event logs can be stored in an S3 bucket:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkHistoryServer
spec:
  logFileDirectory:
    s3:
      prefix: eventlogs/
      bucket:
        ...
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
spec:
  logFileDirectory:
    s3:
      prefix: eventlogs/
      bucket:
        ...

Custom log directory

If there is no structure provided for the desired file system, it can nevertheless be set with the property customLogDirectory. Additional configuration overrides may be necessary in this case.

For instance, to store the Spark event logs in HDFS, the following configuration could be used:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkHistoryServer
spec:
  logFileDirectory:
    customLogDirectory: hdfs://simple-hdfs/eventlogs/  (1)
  nodes:
    envOverrides:
      HADOOP_CONF_DIR: /stackable/hdfs-config  (2)
    podOverrides:
      spec:
        containers:
        - name: spark-history
          volumeMounts:
          - name: hdfs-config
            mountPath: /stackable/hdfs-config
        volumes:
        - name: hdfs-config
          configMap:
            name: hdfs  (3)
---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
spec:
  logFileDirectory:
    customLogDirectory: hdfs://simple-hdfs/eventlogs/  (4)
  sparkConf:
    spark.driver.extraClassPath: /stackable/hdfs-config  (5)
  driver:
    config:
      volumeMounts:
      - name: hdfs-config
        mountPath: /stackable/hdfs-config
  volumes:
  - name: hdfs-config
    configMap:
      name: hdfs

1	A custom log directory that is used for the Spark option `spark.history.fs.logDirectory`. The required dependencies must be on the class path. This is the case for HDFS.
2	The Spark History Server looks for the Hadoop configuration in the directory defined by the environment variable `HADOOP_CONF_DIR`.
3	The ConfigMap containing the Hadoop configuration files `core-site.xml` and `hdfs-site.xml`.
4	A custom log directory that is used for the Spark option `spark.eventLog.dir`. Additionally, the Spark option `spark.eventLog.enabled` is set to `true`.
5	The Spark driver looks for the Hadoop configuration on the class path.

History Web UI

To access the history server web UI, use one of the NodePort services created by the operator. For the example above, the operator created two services as shown:

$ kubectl get svc
NAME                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
spark-history-node           NodePort    10.96.222.233   <none>        18080:30136/TCP     52m
spark-history-node-cleaner   NodePort    10.96.203.43    <none>        18080:32585/TCP     52m

By setting up port forwarding on 18080 the UI can be opened by pointing your browser to http://localhost:18080:

Monitoring

The operator creates a Kubernetes service dedicated specifically to collect metrics for Spark History instances with Prometheus. These metrics are exported via the JMX exporter as the history server doesn’t support the built in Spark prometheus servlet. The service name follows the convention <stacklet name>-history-metrics. Metrics can be scraped at the endpoint <service name>:18081/metrics.