Stackable Operator for Apache Spark
This operator manages Apache Spark applications on Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing.
Getting started
Follow the Getting started guide to get started with Apache Spark using the Stackable operator. The guide will lead you through the installation of the operator and running your first Spark application on Kubernetes.
How the operator works
This operator manages SparkApplication custom resources which you use to define your applications. The operator creates the relevant Kubernetes resources for the job to run.
Custom resources
The operator manages two custom resource kinds: The SparkApplication and the SparkHistoryServer.
The SparkApplication resource is the main point of interaction with the operator. Unlike other Stackable operator custom resources, the SparkApplication does not have roles. An exhaustive list of options is given in the SparkApplication CRD reference .
The SparkHistoryServer has a single node
role.
It is used to deploy a Spark history server that displays application logs from S3 buckets.
Of course, your applications need to write their logs to the same buckets.
Kubernetes resources
For every SparkApplication deployed to the cluster the operator creates a Job, A ServiceAccout and a few ConfigMaps.
The Job runs spark-submit
in a Pod which then creates a Spark driver Pod.
The driver creates its own Executors based on the configuration in the SparkApplication.
The Job, driver and executors all use the same image, which is configured in the SparkApplication resource.
The two main ConfigMaps are the <name>-driver-pod-template
and <name>-executor-pod-template
which define how the driver and executor Pods should be created.
The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a service to connect to.
RBAC
The Spark-Kubernetes RBAC documentation describes what is needed for spark-submit
jobs to run successfully:
minimally a role/cluster-role to allow the driver pod to create and manage executor pods.
However, to add security each spark-submit
job launched by the operator will be assigned its own ServiceAccount.
During the operator installation, a cluster role named spark-k8s-clusterrole
is created with pre-defined permissions.
When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role spark-k8s-clusterrole
.
Integrations
You can read and write data from s3 buckets, load custom job dependencies. Spark also supports easy integration with Apache Kafka which is also supported on the Stackable Data Platform. Have a look at the demos below to see it in action.
Demos
The data-lakehouse-iceberg-trino-spark demo connects multiple components and datasets into a data Lakehouse. A Spark application with structured streaming is used to stream data from Apache Kafka into the Lakehouse.
In the spark-k8s-anomaly-detection-taxi-data demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table.
Supported versions
The Stackable operator for Apache Spark on Kubernetes currently supports the Spark versions listed below. To use a specific Spark version in your SparkApplication, you have to specify an image - this is explained in the Product image selection documentation. The operator also supports running images from a custom registry or running entirely customized images; both of these cases are explained under Product image selection as well.
-
3.5.1 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 17) (LTS)
-
3.4.3 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 11) (deprecated)
-
3.4.2 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 11) (deprecated)
Useful links
-
The spark-k8s-operator GitHub repository
-
The operator feature overview in the feature tracker
-
The SparkApplication and SparkHistorServer CRD documentation