Stackable Operator for Apache Spark
This is an operator manages Apache Spark on Kubernetes clusters. Apache Spark is a powerful open-source big data processing framework that allows for efficient and flexible distributed computing. Its in-memory processing and fault-tolerant architecture make it ideal for a variety of use cases, including batch processing, real-time streaming, machine learning, and graph processing.
Follow the Getting started guide to get started with Apache Spark using the Stackable Operator. The guide will lead you through the installation of the Operator and running your first Spark application on Kubernetes.
How the Operator works
The Stackable Operator for Apache Spark reads a SparkApplication custom resource which you use to define your spark job/application. The Operator creates the relevant Kubernetes resources for the job to run.
The Operator manages two custom resource kinds: The SparkApplication and the SparkHistoryServer.
The SparkApplication resource is the main point of interaction with the Operator. Unlike other Stackable Operator custom resources, the SparkApplication does not have roles. An exhaustive list of options is given on the CRD reference page.
The SparkHistoryServer does have a single
node role. It is used to deploy a Spark history server. It reads data from an S3 bucket that you configure. Your applications need to write their logs to the same bucket.
For every SparkApplication deployed to the cluster the Operator creates a Job, A ServiceAccout and a few ConfigMaps.
The Job runs
spark-submit in a Pod which then creates a Spark driver Pod. The driver creates its own Executors based on the configuration in the SparkApplication. The Job, driver and executors all use the same image, which is configured in the SparkApplication resource.
The two main ConfigMaps are the
<name>-executor-pod-template which define how the driver and executor Pods should be created.
The Spark history server deploys like other Stackable-supported applications: A Statefulset is created for every role group. A role group can have multiple replicas (Pods). A ConfigMap supplies the necessary configuration, and there is a service to connect to.
The Spark-Kubernetes RBAC documentation describes what is needed for
spark-submit jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods.
However, to add security, each
spark-submit job launched by the spark-k8s operator will be assigned its own ServiceAccount.
When the spark-k8s operator is installed via Helm, a cluster role named
spark-k8s-clusterrole is created with pre-defined permissions.
When a new Spark application is submitted, the operator creates a new service account with the same name as the application and binds this account to the cluster role
spark-k8s-clusterrole created by Helm.
You can read and write data from s3 buckets, load custom job dependencies. Spark also supports easy integration with Apache Kafka which is also supported on the Stackable Data Platform. Have a look at the demos below to see it in action.
The data-lakehouse-iceberg-trino-spark demo connects multiple components and datasets into a data Lakehouse. A Spark application with structured streaming is used to stream data from Apache Kafka into the Lakehouse.
In the spark-k8s-anomaly-detection-taxi-data demo Spark is used to read training data from S3 and train an anomaly detection model on the data. The model is then stored in a Trino table.