Allowed Pod disruptions

Any downtime of our products is generally considered to be bad. Although downtime can’t be prevented 100% of the time - especially if the product does not support High Availability - we can try to do our best to reduce it to an absolute minimum.

Kubernetes provides mechanisms to ensure minimal planned downtime. Please keep in mind that this only affects planned (voluntary) downtime of Pods - unplanned Kubernetes node crashes can always occur.

Our operators will always deploy so-called PodDisruptionBudget (PDB) resources as part of a Stacklet. For every role that you specify (e.g. HDFS namenodes or Trino workers) a PDB is created.

Default values

The defaults depend on the individual product and can be found below the "Operations" usage guide.

They are based on our knowledge of each product’s fault tolerance. In some cases they may be a little pessimistic, but they can be adjusted as documented in the following sections.

In general, product roles are split into the following two categories, which serve as guidelines for the default values we apply:

Multiple replicas to increase availability

For these roles (e.g. ZooKeeper servers, HDFS journal + namenodes or HBase masters), only a single Pod is allowed to be unavailable. For example, imagine a cluster with 7 ZooKeeper servers, where 4 servers are required to form a quorum and healthy ensemble. By allowing 2 servers to be unavailable, there is no single point of failure (as there are at least 5 servers available). But there is only a single spare server left. The reason to choose 7 instead of e.g. 5 ZooKeeper servers might be, that there are always at least 2 spare servers. Increasing the number of allowed disruptions and increasing the number of replicas is not improving the general availability.

Multiple replicas to increase performance

For these roles (e.g. HDFS datanodes, HBase regionservers or Trino workers), more than a single Pod is allowed to be unavailable. Otherwise, rolling re-deployments may take very long.

The operators calculate the number of Pods for a given role by adding the number of replicas of every role group that is part of that role.

In case there are no replicas defined on a role group, one Pod will be assumed for this role group, as the created Kubernetes objects (StatefulSets or Deployments) will default to a single replica as well. However, in case there are HorizontalPodAutoscaler in place, the number of replicas of a rolegroup can change dynamically. In this case the operators might falsely assume that role groups have fewer Pods than they actually have. This is a pessimistic approach, as the number of allowed disruptions normally stays the same or even increases when the number of Pods increases. This should be safe, but in some cases more Pods could have been allowed to be unavailable which may increase the duration of rolling re-deployments.

Influencing and disabling PDBs

You can configure

Whether PDBs are written at all
The maxUnavailable replicas for this role PDB

The following example

Sets maxUnavailable for NameNodes to 1
Sets maxUnavailable for DataNodes to 10, which allows downtime of 10% of the total DataNodes.
Disables PDBs for JournalNodes

apiVersion: hdfs.stackable.tech/v1alpha1
kind: HdfsCluster
metadata:
  name: hdfs
spec:
  nameNodes:
    roleConfig: # optional, only supported on role level, *not* on rolegroup
      podDisruptionBudget: # optional
        enabled: true # optional, defaults to true
        maxUnavailable: 1 # optional, defaults to our "smart" calculation
    roleGroups:
      default:
        replicas: 3
  dataNodes:
    roleConfig:
      podDisruptionBudget:
        maxUnavailable: 10
    roleGroups:
      default:
        replicas: 100
  journalnodes:
    roleConfig:
      podDisruptionBudget:
        enabled: false
    roleGroups:
      default:
        replicas: 3

Using you own custom PDBs

In case you are not satisfied with the PDBs that are written by the operators, you can deploy your own.

In case you write custom PDBs, it is your responsibility to take care of the availability of the products

It is important to disable the PDBs created by the Stackable operators as described above before creating your own PDBs, as this is a limitation of Kubernetes.

After disabling the Stackable PDBs, you can deploy you own PDB such as

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: hdfs-journalnode-and-namenode
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: hdfs
      app.kubernetes.io/instance: hdfs
    matchExpressions:
      - key: app.kubernetes.io/component
        operator: In
        values:
          - journalnode
          - namenode

This PDB allows only one Pod out of all the Namenodes and Journalnodes to be down at one time.

Details

Have a look at the ADR on Allowed Pod disruptions for the implementation details.

Known issue with PDBs and certificate rotations

PDBs together with certificate rotations can be problematic in case e.g. commons-operator was unavailable to restart the Pods before the certificate expire. commons-operator uses the evict API in Kubernetes, which respects the PDB. If a Pod is evicted and a PDB would be violated, the Pod is not restarted.

Assume a product like Apache ZooKeeper which needs to form a quorum to function and the PDB only allows a single Pod to be unavailable. As soon as enough certificates of the ZookeeperCluster have expired, all Pods will crash-loop, as they encounter expired certificates. As only the container crash-loops (not the entire Pod), no new certificate is issued. As soon as commons-operator comes online again it tries to evict a Zookeeper Pod. However, this is prohibited, as the PDB would be violated.

We encountered this problem only with the specific outlined case above and only under this circumstances.

Workaround

If you encounter this, only manually deleting those pods can help you out of this situation. A Pod deletion (other than evictions) does not respect PDBs, so the Pods can be restarted anyway. All restarted Pods will get a new certificate, the stacklet should turn healthy again.

Restore working state

Delete pods with e.g. kubectl`.

kubectl delete pod -l app.kubernetes.io/name=zookeeper,app.kubernetes.io/instance=simple-zk
pod "simple-zk-server-default-0" deleted
pod "simple-zk-server-default-1" deleted
pod "simple-zk-server-default-2" deleted

Preventing this situation

The best measure is to make sure that commons-operator is always running, so that it can restart the Pods before the certificates expire.

A hacky way to prevent this situation could be to disable PDBs for the specific stacklet. But this also has the downside, that you are now missing the benefits of the PDB.