HDFS Rack Awareness

Rack awareness is a feature in Apache Hadoop that allows users to define a cluster’s node topology. Hadoop uses that topology to distribute block replicas in a way that maximizes fault tolerance.

For example, when a new block is created, the default behavior is to place one replica on a different node within the same rack, and another on a node in a remote rack. To do this effectively, Hadoop must access information about the underlying infrastructure. In a Kubernetes environment, this involves retrieving data from Pods or Nodes in the cluster.

Configuring rack awareness

To configure rack awareness, use the rackAwareness field in the cluster configuration:

apiVersion: hdfs.stackable.tech/v1alpha1
kind: HdfsCluster
metadata:
  name: simple-hdfs
spec:
  clusterConfig:
    rackAwareness:
      - nodeLabel: topology.kubernetes.io/zone
      - podLabel: app.kubernetes.io/role-group
  nameNodes:
    ...

This creates an internal topology label by combining the values of the topology.kubernetes.io/zone Node label and the app.kubernetes.io/role-group Pod label (e.g. /eu-central-1/rg1).

How it works

In order to enable gathering this information the Hadoop images contain the hdfs-topology-provider on the classpath, which can be configured to read labels from Kubernetes objects.

The operator deploys ClusterRoles and ServicesAccounts with the relevant RBAC rules to allow the Hadoop Pod to access the necessary Kubernetes objects.