First steps

With the operators installed, deploy a Druid cluster and its dependencies. Afterward you can verify that it works by ingesting example data and subsequently query it.

Setup

Three things need to be installed to have a Druid cluster:

  • A ZooKeeper instance for internal use by Druid

  • An HDFS instance to be used as a backend for deep storage

  • A PostgreSQL database to store the metadata of Druid

  • The Druid cluster itself

Create them in this order, each one is created by applying a manifest file. The operators you just installed then create the resources according to the manifests.

ZooKeeper

Create a file named zookeeper.yaml with the following content:

---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperCluster
metadata:
  name: simple-zk
spec:
  image:
    productVersion: 3.9.2
  servers:
    roleGroups:
      default:
        replicas: 1
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-druid-znode
spec:
  clusterRef:
    name: simple-zk
---
apiVersion: zookeeper.stackable.tech/v1alpha1
kind: ZookeeperZnode
metadata:
  name: simple-hdfs-znode
spec:
  clusterRef:
    name: simple-zk

Then create the resources by applying the manifest file:

kubectl apply -f zookeeper.yaml

HDFS

Create hdfs.yaml with the following contents:

---
apiVersion: hdfs.stackable.tech/v1alpha1
kind: HdfsCluster
metadata:
  name: simple-hdfs
spec:
  image:
    productVersion: 3.3.6
  clusterConfig:
    dfsReplication: 1
    zookeeperConfigMapName: simple-hdfs-znode
  nameNodes:
    config:
      listenerClass: external-stable # This exposes your Stacklet outside of Kubernetes. Remove this configuration if this is not desired
    roleGroups:
      default:
        replicas: 2
  dataNodes:
    config:
      listenerClass: external-unstable # This exposes your Stacklet outside of Kubernetes. Remove this configuration if this is not desired
    roleGroups:
      default:
        replicas: 1
  journalNodes:
    roleGroups:
      default:
        replicas: 1

And apply it:

kubectl apply -f hdfs.yaml

PostgreSQL

Install a PostgreSQL database using helm. If you already have a PostgreSQL instance, you can skip this step and use your own below.

helm install postgresql-druid \
--repo https://charts.bitnami.com/bitnami postgresql \
--version 16.1.2 \
--set auth.database=druid \
--set auth.username=druid \
--set auth.password=druid \
--wait

Druid

Create a file named druid.yaml with the following contents:

---
apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: simple-druid
spec:
  image:
    productVersion: 30.0.0
  clusterConfig:
    listenerClass: external-stable # This exposes your Stacklet outside of Kubernetes. Remove this configuration if this is not desired
    zookeeperConfigMapName: simple-druid-znode
    deepStorage:
      hdfs:
        configMapName: simple-hdfs
        directory: /druid
    metadataStorageDatabase:
      dbType: postgresql
      connString: jdbc:postgresql://postgresql-druid/druid
      host: postgresql-druid
      port: 5432
      credentialsSecret: druid-db-credentials
  brokers:
    roleGroups:
      default:
        replicas: 1
  coordinators:
    roleGroups:
      default:
        replicas: 1
  historicals:
    roleGroups:
      default:
        replicas: 1
  middleManagers:
    roleGroups:
      default:
        replicas: 1
  routers:
    roleGroups:
      default:
        replicas: 1
---
apiVersion: v1
kind: Secret
metadata:
  name: druid-db-credentials
stringData:
  username: druid
  password: druid

And apply it:

kubectl apply --server-side -f druid.yaml

This creates the actual Druid Stacklet.

This Druid instance uses Derby (dbType: derby) as a metadata store, which is an interal SQL database. It is not persisted and not suitable for production use! Consult the Druid documentation for a list of supported databases and setup instructions for production instances.

Verify that it works

Submit an ingestion job and then query the ingested data — either through the web interface or the API.

First, make sure that all the Pods in the StatefulSets are ready:

kubectl get statefulset

The output should show all pods ready:

NAME                                 READY   AGE
simple-druid-broker-default          1/1     5m
simple-druid-coordinator-default     1/1     5m
simple-druid-historical-default      1/1     5m
simple-druid-middlemanager-default   1/1     5m
simple-druid-router-default          1/1     5m
simple-hdfs-datanode-default         1/1     6m
simple-hdfs-journalnode-default      1/1     6m
simple-hdfs-namenode-default         2/2     6m
simple-zk-server-default             3/3     7m

Ideally you use stackablectl stacklet list to find out the address the Druid router is reachable at and use that address.

As an alternative, you can create a port-forward for the Druid Router:

kubectl port-forward svc/simple-druid-router 9088 > /dev/null 2>&1 &

Ingest example data

Next, ingest some example data using the web interface. If you prefer to use the command line instead, follow the instructions in the collapsed section below.

Alternative: Using the command line

If you prefer to not use the web interface and instead interact with the API, create a file ingestion_spec.json with the following contents:

{
  "type": "index_parallel",
  "spec": {
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "http",
        "uris": [
          "https://druid.apache.org/data/wikipedia.json.gz"
        ]
      },
      "inputFormat": {
        "type": "json"
      }
    },
    "dataSchema": {
      "granularitySpec": {
        "segmentGranularity": "day",
        "queryGranularity": "none",
        "rollup": false
      },
      "dataSource": "wikipedia",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
          "isRobot",
          "channel",
          "flags",
          "isUnpatrolled",
          "page",
          "diffUrl",
          {
            "type": "long",
            "name": "added"
          },
          "comment",
          {
            "type": "long",
            "name": "commentLength"
          },
          "isNew",
          "isMinor",
          {
            "type": "long",
            "name": "delta"
          },
          "isAnonymous",
          "user",
          {
            "type": "long",
            "name": "deltaBucket"
          },
          {
            "type": "long",
            "name": "deleted"
          },
          "namespace",
          "cityName",
          "countryName",
          "regionIsoCode",
          "metroCode",
          "countryIsoCode",
          "regionName"
        ]
      }
    },
    "tuningConfig": {
      "type": "index_parallel",
      "partitionsSpec": {
        "type": "dynamic"
      }
    }
  }
}

Submit the file with the following curl command:

curl -s -k -X 'POST' -H 'Content-Type:application/json' -d @ingestion_spec.json https://localhost:9088/druid/indexer/v1/task

Continue with the next section.

To open the web interface navigate your browser to https://localhost:9088/ to find the dashboard:

dashboard

Now load the example data:

load example

Click through all pages of the load process. You can also follow the Druid Quickstart Guide.

Once you finished the ingestion dialog you should see the ingestion overview with the job, which eventually shows SUCCESS:

load success

Query the data

Query from the user interface by navigating to the "Query" interface in the menu and query the wikipedia table:

Alternative: Using the command line

To query from the commandline, create a file called query.json with the query:

{
  "query": "SELECT page, COUNT(*) AS Edits FROM wikipedia GROUP BY page ORDER BY Edits DESC LIMIT 10"
}

and execute it:

curl -s -k -X 'POST' -H 'Content-Type:application/json' -d @query.json https://localhost:9088/druid/v2/sql

The result should be similar to:

query

Great! You’ve set up your first Druid cluster, ingested some data and queried it in the web interface.

What’s next

Have a look at the Usage guide page to find out more about the features of the operator, such as S3-backed deep storage (as opposed to the HDFS backend used in this guide) or OPA-based authorization.