Spark Connect

Support for Apache Spark Connect is considered experimental and is subject to change in future releases. Spark Connect is a young technology and there are important questions to be answered yet, mostly related to security and multi-tenancy.

Apache Spark Connect is a remote procedure call (RPC) server that allows clients to run Spark applications on a remote cluster. Clients can connect to the Spark Connect server using a variety of programming languages, editors and IDEs without needing to install Spark locally.

The Stackable Spark operator can set up Spark Connect servers backed by Kubernetes as a distributed execution engine.

Deployment

The example below demonstrates how to set up a Spark Connect server and apply some customizations.

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkConnectServer
metadata:
  name: spark-connect (1)
spec:
  image:
    productVersion: "3.5.5" (2)
    pullPolicy: IfNotPresent
  args:
    - "--package org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1" (3)
  server:
    podOverrides:
      spec:
        containers:
          - name: spark
            env:
              - name: DEMO_GREETING (4)
                value: "Hello"
    jvmArgumentOverrides:
      add:
        - -Dmy.custom.jvm.arg=customValue (5)
    config:
      logging:
        enableVectorAgent: False
        containers:
          spark:
            custom:
              configMap: spark-connect-log-config (6)
    configOverrides:
      spark-defaults.conf:
        spark.driver.cores: "3" (7)
  executor:
    configOverrides:
      spark-defaults.conf:
        spark.executor.memoryOverhead: "1m" (8)
        spark.executor.instances: "3"
    config:
      logging:
        enableVectorAgent: False
        containers:
          spark:
            custom:
              configMap: spark-connect-log-config
1 The name of the Spark Connect server.
2 Version of the Spark Connect server.
3 Additional package to install when starting the Spark Connect server and executors.
4 Environment variable to be created via podOverrides. Alternatively, the environment variable can be set in the spec.server.envOverrides section.
5 Additional argument to be passed to the Spark Connect JVM settings. Do not use this to tweak heap settings. Use spec.server.jvmOptions instead.
6 A custom log4j configuration file to be used by the Spark Connect server. The config map must have an entry called log4j.properties.
7 Customize the driver properties in the server role. The number of cores here is not related to Kubernetes cores!
8 Customize spark.executor.* and spark.kubernetes.executor.* in the executor role.

Metrics

The server pod exposes Prometheus metrics at the following endpoints:

  • /metrics/prometheus for driver instances.

  • /metrics/executors/prometheus for executor instances.

To customize the metrics configuration use the `spec.server.configOverrides' like this:

spec:
  server:
    configOverrides:
      metrics.properties:
        applications.sink.prometheusServlet.path: "/metrics/applications/prometheus"

The example above adds a new endpoint for application metrics.

Notable Omissions

The following features are not supported by the Stackable Spark operator yet

  • Integration with the Spark History Server.

  • Authorization and authentication. Currently, anyone with access to the Spark Connect service can run jobs.

  • Volumes and volume mounts can be added only with pod overrides.

  • Job dependencies must be provisioned as custom images or via --packages or --jars arguments.

Known Issues

  • Dynamically provisioning the iceberg runtime leads to "iceberg.SparkWrite$WriterFactory" ClassNotfoundException when attempting to use it from clients.