Stackable Operator for Apache Spark

This is an operator manages Apache Spark on Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing.

Getting Started

Follow the Getting started guide to get started with Apache Spark using the Stackable Operator. The guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes.

How the Operator works

The Stackable Operator for Apache Spark reads a SparkApplication custom resource which you use to define your spark job/application. The Operator creates the relevant Kubernetes resources for the job to run.

Custom resources

The Operator manages two custom resource kinds: The SparkApplication and the SparkHistoryServer.

The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom resources, the SparkApplication does not have roles. An exhaustive list of options is given on the CRD reference page.

The SparkHistoryServer does have a single node role. It is used to deploy a Spark history server. It reads data from an S3 bucket that you configure. Your applications need to write their logs to the same bucket.

Kubernetes resources

For every SparkApplication deployed to the cluster the Operator creates a Job, A ServiceAccout and a few ConfigMaps.

The Job runs spark-submit in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured in the SparkApplication resource.

The two main ConfigMaps are the <name>-driver-pod-template and <name>-executor-pod-template which define how the driver and executor Pods should be created.

The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a service to connect to.

RBAC

The Spark-Kubernetes RBAC documentation describes what is needed for spark-submit jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods.

However, to add security, each spark-submit job launched by the spark-k8s operator will be assigned its own ServiceAccount.

When the spark-k8s operator is installed via Helm, a cluster role named spark-k8s-clusterrole is created with pre-defined permissions.

When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role spark-k8s-clusterrole created by Helm.

Integrations

You can read and write data from s3 buckets, load custom job dependencies. Spark also supports easy integration with Apache Kafka which is also supported on the Stackable Data Platform. Have a look at the demos below to see it in action.

Demos

The data-lakehouse-iceberg-trino-spark demo connects multiple components and datasets into a data Lakehouse. A Spark application with structured streaming is used to stream data from Apache Kafka into the Lakehouse.

In the spark-k8s-anomaly-detection-taxi-data demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table.

Supported Versions

The Stackable Operator for Apache Spark on Kubernetes currently supports the following versions of Spark:

3.2.1-hadoop3.2
3.2.1-hadoop3.2-python39
3.3.0-hadoop3
3.4.0-hadoop3