S3 resources

Many of the tools on the Stackable platform integrate with S3 storage in some way. For example Druid can ingest data from S3 and also use S3 as a backend for deep storage, Spark can use an S3 bucket to store application files and data.

S3Connection and S3Bucket

Stackable uses S3Connection and S3Bucket objects to configure access to S3 storage. A S3Connection object contains information such as the host name of the S3 server, it’s port, TLS parameters and access credentials. A S3Bucket contains the name of the bucket and a reference to a S3Connection, the connection to the server where the bucket is located. A S3Connection can be referenced by multiple buckets.

Here’s an example of a simple S3Connection object and a S3Bucket referencing that connection:

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Connection
metadata:
  name: my-connection-resource
spec:
  host: s3.example.com
  port: 4242
---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Bucket
metadata:
  name: my-bucket-resource
spec:
  bucketName: my-example-bucket
  connection:
    reference: my-connection-resource

Object Reference Structure

S3Bucket(s) reference S3Connection(s) objects. Both types of objects can be referenced by other resources. For example in a DruidCluster you can specify a bucket for deep storage and an S3Connection for data ingestion. S3Connection objects can be defined in a standalone fashion or they can be inlined into a bucket object. Similarly a bucket can be defined in a standalone object or inlined into an enclosing object.

A diagram showing four variations (A

The diagram above shows four examples of how the objects can be structured.

Variant A shows all S3 objects inlined in a DruidCluster resource. This is a very convenient way to quickly test something since the entire configuration is encapsulated in a single but potentially large manifest.

In variant B the S3 bucket has been split out into its own resource. It can now be referred to by multiple different tools as well.

In variant C the bucket is inlined in the cluster definition. This makes sense if you have a dedicated bucket for a specific purpose, if it is only used in this one cluster instance, in this single product, but they are still hosted in the same place, so they still share a connection.

In variant D all objects are separate from each other. This provides maximum re-usability because the same connection or bucket object can be referenced by multiple resources. It also allows for separation of concerns across team members. Cluster administrators can define S3 connection objects that developers reference in their applications.

Examples

To clarify the concept, a few examples will be given, using a DruidCluster resource as an example.

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: my-druid-cluster
spec:
  deepStorage:
    # to be defined ...
  # more spec here ...

Inline definition

The inline definition is variant A in the figure above.

The DruidCluster encapsulates an S3Bucket

This variant has the advantage that everything is defined in a single file, right where it is going to be used:

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: my-druid-cluster
spec:
  deepStorage:
      s3:
        inline: (1)
          bucketName: my-bucket
          connection:
            inline: (2)
              host: test-minio
              port: 9000
              credentials:
                secretClass: druid-s3-credentials (3)
              tls:
                verification:
                  server:
                    caCert:
                      secretClass: s3-certificate-class (4)
  # more spec here ...
1 The inline definition of the bucket. The bucket definition contains bucketName and connection.
2 The inline definition of the connection. It contains the host and port.
3 The credentials used by Druid to access the S3 bucket.
4 The TLS certificate used by Druid to verify the S3 server.

Stand-alone resources

To reuse S3Buckets across different applications, they can be defined as stand-alone resources. In a similar fashion, S3Connections can also be defined as stand-alone resources. In the example below, ony bucket is used by a DruidCluster and a second bucket is used by both a TrinoCluster and a SparkCluster. Both buckets reference the same S3Connection.

One S3Connection is referenced by two different S3Buckets. The first Bucket is referenced by a DruidCluster and the second bucket is referenced by a SparkCluster and TrinoCluster. No object is inlined.

The specification for the S3Connection named my-connection-resource is shown below:

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Connection
metadata:
  name: my-connection-resource
spec:
  host: s3.example.com
  port: 4242

The S3Bucket referenced the connection above by setting the spec.connection.reference field to the name of the connection. In this case my-connection-resource.

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Bucket
metadata:
  name: my-bucket-resource
spec:
  bucketName: my-example-bucket
  connection:
    reference: my-connection-resource

To use the bucket data in a Druid cluster, set the spec.deepStorage.s3.reference field to the name of the bucket as shown below:

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: my-druid-cluster
spec:
  deepStorage:
      s3:
        reference: my-bucket-resource
  # more spec here ...

Credentials

No matter if a connection is specified inline or as a separate object, the credentials are always specified in the same way. You will need a Secret containing the access key ID and secret access key, a SecretClass and then a reference to this SecretClass where you want to specify the credentials.

The Secret:

apiVersion: v1
kind: Secret
metadata:
  name: s3-credentials
  labels:
    secrets.stackable.tech/class: s3-credentials-class  (1)
stringData:
  accessKey: YOUR_VALID_ACCESS_KEY_ID_HERE
  secretKey: YOUR_SECRET_ACCESS_KEY_THAT_BELONGS_TO_THE_KEY_ID_HERE
1 This label connects the Secret to the SecretClass.

The SecretClass:

apiVersion: secrets.stackable.tech/v1alpha1
kind: SecretClass
metadata:
  name: s3-credentials-class
spec:
  backend:
    k8sSearch:
      searchNamespace:
        pod: {}

Reference it from the connection object:

...
host: test-minio
port: 9000
credentials:
  secretClass: s3-credentials-class
...

TLS

TLS certificates are specified in a similar way. You will need a SecretClass that is referenced by the connection object. When using the k8sSearch backend, a Secret object containing the ca certificate is also needed.

The Secret must contain the CA certificate in the form of a ca.crt key, which contains the Certificate Authority that signed the server’s certificate. The values are expected to be base64 encoded PEM files.

In this case, the SecretClass would look like this:

apiVersion: secrets.stackable.tech/v1alpha1
kind: SecretClass
metadata:
  name: s3-certificate-class
spec:
  backend:
    k8sSearch:
      searchNamespace:
        pod: {}

And the Secret object defining the certificate/key pair:

apiVersion: v1
kind: Secret
metadata:
  name: s3-certificates
  labels:
    secrets.stackable.tech/class: s3-certificate-class
data:
  ca.crt: ...

Finally, reference the SecretClass from the connection object:

...
host: test-minio
port: 9000
tls:
  verification:
    server:
      caCert:
        secretClass: s3-certificate-class

What’s next