Allowed Pod disruptions
Any downtime of our products is generally considered to be bad. Although downtime can’t be prevented 100% of the time - especially if the product does not support High Availability - we can try to do our best to reduce it to an absolute minimum.
Kubernetes has mechanisms to ensure minimal planned downtime. Please keep in mind, that this only affects planned (voluntary) downtime of Pods - unplanned Kubernetes node crashes can always occur.
Our product operator will always deploy so-called PodDisruptionBudget (PDB) resources alongside the products. For every role that you specify (e.g. HDFS namenodes or Trino workers) a PDB is created.
The defaults depend on the individual product and can be found below the "Operations" usage guide.
They are based on our knowledge of each product’s fault tolerance. In some cases they may be a little pessimistic, but they can be adjusted as documented in the following sections.
In general, product roles are split into the following two categories, which serve as guidelines for the default values we apply:
For these roles (e.g. ZooKeeper servers, HDFS journal + namenodes or HBase masters), only a single Pod is allowed to be unavailable. For example, imagine a cluster with 7 ZooKeeper servers, where 4 servers are required to form a quorum and healthy ensemble. By allowing 2 servers to be unavailable, there is no single point of failure (as there are at least 5 servers available). But there is only a single spare server left. The reason to choose 7 instead of e.g. 5 ZooKeeper servers might be, that there are always at least 2 spare servers. Increasing the number of allowed disruptions and increasing the number of replicas is not improving the general availability.
For these roles (e.g. HDFS datanodes, HBase regionservers or Trino workers), more than a single Pod is allowed to be unavailable. Otherwise, rolling re-deployments may take very long.
|The operators calculate the number of Pods for a given role by adding the number of replicas of every rolegroup that is part of that role.
In case there are no replicas defined on a rolegroup, one Pod will be assumed for this rolegroup, as the created Kubernetes objects (StatefulSets or Deployments) will default to a single replica as well. However, in case there are HorizontalPodAutoscaler in place, the number of replicas of a rolegroup can change dynamically. In this case the operators might falsely assume that rolegroups have fewer Pods than they actually have. This is a pessimistic approach, as the number of allowed disruptions normally stays the same or even increases when the number of Pods increases. This should be safe, but in some cases more Pods could have been allowed to be unavailable which may increase the duration of rolling re-deployments.
You can configure
Whether PDBs are written at all
maxUnavailablereplicas for this role PDB
The following example
maxUnavailablefor NameNodes to
maxUnavailablefor DataNodes to
10, which allows downtime of 10% of the total DataNodes.
Disables PDBs for JournalNodes
roleConfig: # optional, only supported on role level, *not* on rolegroup
podDisruptionBudget: # optional
enabled: true # optional, defaults to true
maxUnavailable: 1 # optional, defaults to our "smart" calculation
In case you are not satisfied with the PDBs that are written by the operators, you can deploy your own.
|In case you write custom PDBs, it is your responsibility to take care of the availability of the products
|It is important to disable the PDBs created by the Stackable operators as described above before creating your own PDBs, as this is a limitation of Kubernetes.
After disabling the Stackable PDBs, you can deploy you own PDB such as
- key: app.kubernetes.io/component
This PDB allows only one Pod out of all the Namenodes and Journalnodes to be down at one time.
Have a look at the ADR on Allowed Pod disruptions for the implementation details.