ADR029: Standardize database connections

Status: accepted
Deciders:
- Felix Hennig
- Lukas Voetmand
- Malte Sander
- Razvan Mihai
- Sascha Lautenschläger
- Sebastian Bernauer
Date: 2022-12-08

Technical Story: https://github.com/stackabletech/issues/issues/238

We might want to incorporate changes to address https://github.com/stackabletech/issues/issues/681, maybe as V2?

Parts of this document might be out of date. The source of truth is in the finished implementation in operator-rs

Context and Problem Statement

Many products supported by the Stackable Data Platform require databases to store metadata. Currently there is no uniform, consistent way to define database connections. In addition, some Stackable operators define database credentials to be provided inline and in plain text in the cluster definitions.

A quick analysis of the status-quo regarding database connection definitions shows how different operators handle them:

Apache Hive: the cluster custom resource defined a field called "database" with access credentials in clear text.
Apache Airflow and Apache Superset: uses a field called "credentialSecret" that contains multiple different database connection definitions. Even worse, it contains credentials not related to a database, such as a secret to encrypt the cookies. In case of Airflow, this secret only supports the Celery executor.
Apache Druid: uses a field called "metadataStorageDatabase" where access credentials are expected to be inline and in plain text.

Decision Drivers

Here we attempt to standardize the way database connections are defined across the Stackable platform in such a way that:

Different database systems are supported.
Access credentials are defined in Kubernetes Secret` objects.
Product configuration only allows (product) supported databases …
But there is a generic way to configure additional database systems.
Misconfigured connections should be rejected as early as possible in the product lifecycle.
Generated CRD documentation is easy to follow by users.

Initially we thought that database connections should be implemented as stand-alone Kubernetes resources and should be referenced in cluster definitions. This idea was thrown away mostly because sharing database connections across products is not good practice and we shouldn’t encourage it.

Considered Options

(rejected) DatabaseConnection A generic resource definition.
(rejected) Database driver specific resource definition.
(accepted) Product supported and a generic DB specifications.

1. (rejected) `DatabaseConnection` A generic resource definition

The first idea was to introduce a new Kubernetes resource called DatabaseConnection with the following fields:

Field name

Description

credentials

A string with name of a Secret containing at least a user name and a password field. Additional fields are allowed.

driver

A string with the database driver named. This is a generic field that identifies the type of the database used.

protocol

The protocol prefix of the final connection string. Most Java based products will use jdbc:.

host

A string with the host name to connect to.

instance

A string with the database instance to connect to. Optional.

port

A positive integer with the TCP port used for the connection. Optional.

properties

A dictionary of additional properties for driver tuning like number of client threads, various buffer sizes and so on. Some drivers, like derby use this to define the database name and whether the DB should by automatically created or not. Optional

The Secret object referenced by credentials must contain two fields named USER_NAME and PASSWORD but can contain additional fields like first name, last name, email, user role and so on.

Examples

These examples showcase the spec change required from the current status:

The current Druid metadata database connection

---
metadataStorageDatabase:
    dbType: postgresql
    connString: jdbc:postgresql://druid-postgresql/druid
    host: druid-postgresql
    port: 5432
    user: druid
    password: druid

becomes

---
metadataStorageDatabase: druid-metadata-connection

where druid-metadata-connection is a standalone DatabaseConnection resource defined as follows

---
apiVersion: db.stackable.tech/v1alpha1
kind: DatabaseConnection
metadata:
    name: druid-metadata-connection
spec:
    driver: postgresql
    host: druid-postgresql
    port: 5432
    protocol: jdbc:postgresql
    instance: druid
    credentials: druid-metadata-credentials

and the credentials field contains the name of a Kubernetes Secret defined as:

---
apiVersion: v1
kind: Secret
metadata:
  name: druid-metadata-credentials
type: Opaque
data:
  USER_NAME: druid
  PASSWORD: druid

This idea was discarded because it didn’t satisfy all acceptance criteria. In particular it wouldn’t be possible to catch misconfigurations at cluster creation time.

(rejected) 2. Database driver specific resource definition.

In an attempt to address the issues of the first option above, a more detailed specification was necessary. Here, database generic configurations are possible that can be better validated, as in the example below.

---
apiVersion: databaseconnection.stackable.tech/v1alpha1
kind: DatabaseConnection
metadata:
    name: druid-metadata-connection
    namespace: default
spec:
  database:
    postgresql:
      host: druid-postgresql # mandatory
      port: 5432 # defaults to some port number - depending on wether tls is enabled
      schema: druid # defaults to druid
      credentials: druid-postgresql-credentials # mandatory. key username and password
      parameters: {} # optional
    redis:
      host: airflow-redis-master # mandatory
      port: 6379 # defaults to some port number - depending on wether tls is enabled
      schema: druid # defaults to druid
      credentials: airflow-redis-credentials # optional. key password
      parameters: {} # optional
    derby:
      location: /tmp/derby/ # optional, defaults to /tmp/derby-{metadata.name}/derby.db
      parameters: # optional
        create: "true"
    genericConnectionString:
      driver: postgresql
      format: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset&param1=value1&param2=value2
      secret: ... # optional
         SUPERSET_DB_USER: ...
         SUPERSET_DB_PASS: ...
         SUPERSET_DB_PORT: ...
    generic:
      driver: postgresql
      host: superset-postgresql.default.svc.cluster.local # optional
      port: 5432 # optional
      protocol: pgsql123 # optional
      instance: superset # optional
      credentials: name-of-secret-with-credentials #optional
      parameters: {...} # optional
      connectionStringFormat: "{protocol}://{credentials.user_name}:{credentials.credentials}@{host}:{port}/{instance}&[parameters,;]"
      tls: # optional
        verification:
          ca_cert:
            ...
In addition, a second generic DB type (`genericConnectionString`) is introduced. This specification allows templating connection URLs with variables defined in secrets and it's not restricted only to user credentials.

This proposal was rejected because for the same reason as the first proposal. In addition, it fails to make possible DB configurations product specific.

(accepted) Product supported and a generic DB specifications.

It seems that an unique, platform wide mechanism to describe database connections that also fulfills all acceptance criteria is not feasible. Database drivers and product configurations are too diverse and cannot be forced into a type safe specification.

Thus the single, global connection manifest needs to split into two different categories, each covering a subset of the acceptance criteria:

A database specific mechanism. This allows to catch misconfigurations early, it promotes good documentation and uniformity inside the platform.
An operator specific mechanism. This is a wildcard that can be used to configure database connections that are not officially supported by the products but that can still be partially validated early.

The first mechanism requires the operator framework to provide predefined structures and supporting functions for widely available database systems such as: PostgreSQL, MySQL, MariaDB, Oracle, SQLite, Derby, Redis and so on. This doesn’t mean that all products can be configured with all DB implementations. The product definitions will only allow the subset that is officially supported by the products. For that, every product operator defines a complex enum of exactly the databases it supports.

Database specific manifests

Support for the following database systems is planned. Additional systems may be added in the future.

PostgreSQL

postgresql:
  host: my-airflow.default.svc.cluster.local # mandatory
  database: my_database # mandatory
  port: 5432 # optional, default is 5432
  credentialsSecretName: airflow-postgresql-credentials # mandatory
  parameters:
    createDatabaseIfNotExist: true
    foo: bar

PostgreSQL supports multiple authentication mechanisms as described here.

2.) MySQL

mysql:
  host: my-airflow.default.svc.cluster.local
  database: my_database
  port: 3306 # optional, default is 3306
  credentialsSecretName: airflow-mysql-credentials # mandatory
  parameters:
    createDatabaseIfNotExist: true
    foo: bar

MySQL supports multiple authentication mechanisms as described here.

3.) Redis

We need Redis e.g. for celery brokers or result databases.

redis:
  host: my-redis # mandatory
  port: 6379 # optional, default is 6379
  databaseId: 13 # optional, defaults to 0
  credentialsSecretName: redis-credentials # mandatory

4.) Derby

Derby is used often as an embedded database for testing and prototyping ideas and implementations. It’s not recommended for production use-cases.

derby:
  location: /tmp/my-database/ # optional, defaults to /tmp/derby/{unique_database_name}/derby.db