Spark Connect

Support for Apache Spark Connect is considered experimental and is subject to change in future releases. Spark Connect is a young technology and there are important questions to be answered yet, mostly related to security and multi-tenancy.

Apache Spark Connect is a remote procedure call (RPC) server that allows clients to run Spark applications on a remote cluster. Clients can connect to the Spark Connect server using a variety of programming languages, editors and IDEs without needing to install Spark locally.

The Stackable Spark operator can set up Spark Connect servers backed by Kubernetes as a distributed execution engine.

Deployment

The example below demonstrates how to set up a Spark Connect server and apply some customizations.

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkConnectServer
metadata:
  name: spark-connect (1)
spec:
  image:
    productVersion: "3.5.7" (2)
    pullPolicy: IfNotPresent
  args:
    - "--package org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1" (3)
  server:
    podOverrides:
      spec:
        containers:
          - name: spark
            env:
              - name: DEMO_GREETING (4)
                value: "Hello"
    jvmArgumentOverrides:
      add:
        - -Dmy.custom.jvm.arg=customValue (5)
    config:
      logging:
        enableVectorAgent: false
        containers:
          spark:
            custom:
              configMap: spark-connect-log-config (6)
    configOverrides:
      spark-defaults.conf:
        spark.driver.cores: "3" (7)
  executor:
    configOverrides:
      spark-defaults.conf:
        spark.executor.memoryOverhead: "1m" (8)
        spark.executor.instances: "3"
    config:
      logging:
        enableVectorAgent: false
        containers:
          spark:
            custom:
              configMap: spark-connect-log-config

1	The name of the Spark Connect server.
2	Version of the Spark Connect server.
3	Additional package to install when starting the Spark Connect server and executors.
4	Environment variable to be created via `podOverrides`. Alternatively, the environment variable can be set in the `spec.server.envOverrides` section.
5	Additional argument to be passed to the Spark Connect JVM settings. Do not use this to tweak heap settings. Use `spec.server.jvmOptions` instead.
6	A custom log4j configuration file to be used by the Spark Connect server. The config map must have an entry called `log4j.properties`.
7	Customize the driver properties in the `server` role. The number of cores here is not related to Kubernetes cores!
8	Customize `spark.executor.` and `spark.kubernetes.executor.` in the `executor` role.

Monitoring

The operator creates a Kubernetes service dedicated specifically to collect metrics for Spark Connect instances with Prometheus. The service name follows the convention <stacklet name>-server-metrics. This service exposes Prometheus metrics at the following endpoints:

<service name>:4040/metrics/prometheus for driver instances.
<service name>:4040/metrics/executors/prometheus for executor instances.

To customize the metrics configuration use the `spec.server.configOverrides' like this:

spec:
  server:
    configOverrides:
      metrics.properties:
        applications.sink.prometheusServlet.path: "/metrics/applications/prometheus"

The example above adds a new endpoint for application metrics.

Spark History Server

Unfortunately integration with the Spark History Server is not supported yet. The connect server seems to ignore the spark.eventLog properties while also prohibiting clients to set them programmatically.

Notable Omissions

The following features are not supported by the Stackable Spark operator yet

Authorization and authentication. Currently, anyone with access to the Spark Connect service can run jobs.
Volumes and volume mounts can be added only with pod overrides.
Job dependencies must be provisioned as custom images or via --packages or --jars arguments.

Known Issues

Dynamically provisioning the iceberg runtime leads to "iceberg.SparkWrite$WriterFactory" ClassNotfoundException when attempting to use it from clients.