Operations

This section of the documentation is intended for the operations teams that maintain a Stackable Data Platform installation. It provides you with the necessary details to operate it in a production environment.

Service availability

Make sure to go through the following checklist to achieve the maximum level of availability for your services.

Make setup highly available (HA): In case the product supports running in an HA fashion, our operators will automatically configure it for you. You only need to make sure that you deploy a sufficient number of replicas. Please note that some products don’t support HA.
Reduce the number of simultaneous pod disruptions (unavailable replicas): The Stackable operators write defaults based upon knowledge about the fault tolerance of the product, which should cover most of the use-cases. For details have a look at Allowed Pod disruptions.
Reduce impact of pod disruptions: Many HA capable products offer a way to gracefully shut down the service running within the Pod. The flow is as follows: Kubernetes wants to shut down the Pod and calls a hook into the Pod, which in turn interacts with the product, signaling it to gracefully shut down. The final deletion of the Pod is then blocked until the product has successfully migrated running workloads away from the Pod that is to be shut down. Details covering the graceful shutdown mechanism are described in Graceful shutdown as well as the actual operator documentation.
Spread workload across multiple Kubernetes nodes, racks, datacenter rooms or datacenters to guarantee availability in the case of e.g. power outages or fire in parts of the datacenter. All of this is supported by configuring an antiAffinity as documented in Pod placement
Reduce the frequency of disruptions: Although we try our best to reduce the impact of disruptions, some tools simply don’t support HA setups. One example is the Trino coordinator - if you restart it, all running queries will fail. Many products use temporary credentials (such as TLS certificates), which have a short lifetime by default for maximum security. The Temporary credentials lifetime page describes how you can increase the lifetime of this temporary credentials too avoid frequent restarts.

Maintenance actions

Sometimes you want to quickly shut down a product or update the Stackable operators without all the managed products restarting at the same time. You can achieve this using the following methods:

Quickly stop and start a whole product using stopped as described in Cluster operations.
Prevent any changes to your deployed product using reconciliationPaused as described in Cluster operations.

Performance

Compute resources: You can configure the available resource every product has using Resource management. The defaults are very restrained, as you should be able to spin up multiple products running on your Laptop.
Autoscaling: Although not supported by the platform natively yet, you can use HorizontalPodAutoscaler to autoscale the number of Pods running for a given rolegroup dynamically based upon resource usage. To achieve this you need to omit the number of replicas on the rolegroup to be scaled, which in turn results in the created StatefulSet not having any replicas set as well. Afterwards you can deploy a HorizontalPodAutoscaler as usual. Please note that not all product-operators have implemented graceful shutdown, so the product might be disturbed during scale down. Later platform versions will support autoscaling natively with sensible defaults and will deploy HorizontalPodAutoscaler objects for you.
Co-location: You can not only use Pod placement to achieve more resilience, but also to co-locate products that communicate frequently with each other. One example is placing HBase regionservers on the same Kubernetes node as the HDFS datanodes. Our operators take this into account and co-locate connected services by default. If you are not satisfied with the automatically created affinities you can use Pod placement to configure your own.
Dedicated nodes: If you want to have certain services running on dedicated nodes you can also use Pod placement to force the Pods to be scheduled on certain nodes. This is especially helpful if you e.g. have Kubernetes nodes with 16 cores and 64 GB, as you could allocate nearly 100% of these node resources to your Spark executors or Trino workers. In this case it is important that you taint your Kubernetes nodes and use podOverrides to add a toleration for the taint.