S3 resources

Many of the tools on the Stackable platform integrate with S3 storage in some way. For example Druid can ingest data from S3 and also use S3 as a backend for deep storage, Spark can use an S3 bucket to store application files and data.

S3Connection and S3Bucket

Stackable uses S3Connection and S3Bucket objects to configure access to S3 storage. An S3Connection object contains information such as the host name of the S3 server, it’s port, TLS parameters and access credentials. An S3Bucket contains the name of the bucket and a reference to an S3Connection, the connection to the server where the bucket is located. An S3Connection can be referenced by multiple buckets.

Here’s an example of a simple S3Connection object and an S3Bucket referencing that connection:

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Connection
metadata:
  name: my-connection-resource
spec:
  host: s3.example.com
  port: 4242
---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Bucket
metadata:
  name: my-bucket-resource
spec:
  bucketName: my-example-bucket
  connection:
    reference: my-connection-resource

Object Reference Structure

S3Bucket(s) reference S3Connection(s) objects. Both types of objects can be referenced by other resources. For example in a DruidCluster you can specify a bucket for deep storage and an S3Connection for data ingestion. S3Connection objects can be defined in a standalone fashion or they can be inlined into a bucket object. Similarly a bucket can be defined in a standalone object or inlined into an enclosing object.

The diagram above shows four examples of how the objects can be structured.

Variant A shows all S3 objects inlined in a DruidCluster resource. This is a very convenient way to quickly test something since the entire configuration is encapsulated in a single but potentially large manifest.

In variant B the S3 bucket has been split out into its own resource. It can now be referred to by multiple different tools as well.

In variant C the bucket is inlined in the cluster definition. This makes sense if you have a dedicated bucket for a specific purpose, if it is only used in this one cluster instance, in this single product, but they are still hosted in the same place, so they still share a connection.

In variant D all objects are separate from each other. This provides maximum re-usability because the same connection or bucket object can be referenced by multiple resources. It also allows for separation of concerns across team members. Cluster administrators can define S3 connection objects that developers reference in their applications.

Examples

To clarify the concept, a few examples will be given, using a DruidCluster resource as an example.

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: my-druid-cluster
spec:
  deepStorage:
    # to be defined ...
  # more spec here ...

Inline definition

The inline definition is variant A in the figure above.

The DruidCluster encapsulates an S3Bucket

This variant has the advantage that everything is defined in a single file, right where it is going to be used:

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: my-druid-cluster
spec:
  deepStorage:
      s3:
        inline: (1)
          bucketName: my-bucket
          connection:
            inline: (2)
              host: test-minio
              port: 9000
  # more spec here ...

1	The inline definition of the bucket. The bucket definition contains `bucketName` and `connection`.
2	The inline definition of the connection. It contains the `host` and `port`.

Stand-alone resources

Often multiple buckets are used across a data pipeline, as well as buckets being used by different applications, so stand-alone resource definitions that can be referenced from multiple objects make sense.

One S3Connection is referenced by two different S3Buckets. The first Bucket is referenced by a DruidCluster and the second bucket is referenced by a SparkCluster and TrinoCluster. No object is inlined.

The DruidCluster references the S3Bucket, which in turn references the S3Connection. First the definition of the S3Connection:

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Connection
metadata:
  name: my-connection-resource
spec:
  host: s3.example.com
  port: 4242

Then the bucket, which references the connection:

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Bucket
metadata:
  name: my-bucket-resource
spec:
  bucketName: my-example-bucket
  connection:
    reference: my-connection-resource

You can then use this bucket, for example in Druid, as a deep storage:

apiVersion: druid.stackable.tech/v1alpha1
kind: DruidCluster
metadata:
  name: my-druid-cluster
spec:
  deepStorage:
      s3:
        reference: my-bucket-resource
  # more spec here ...

Credentials

No matter if a connection is specified inline or as a separate object, the credentials are always specified in the same way. You will need a Secret containing the access key ID and secret access key, a SecretClass and then a reference to this SecretClass where you want to specify the credentials.

The Secret:

apiVersion: v1
kind: Secret
metadata:
  name: s3-credentials
  labels:
    secrets.stackable.tech/class: s3-credentials-class  (1)
stringData:
  accessKey: YOUR_VALID_ACCESS_KEY_ID_HERE
  secretKey: YOUR_SECRET_ACCESS_KEY_THAT_BELONGS_TO_THE_KEY_ID_HERE

1	This label connects the `Secret` to the `SecretClass`.

The SecretClass:

apiVersion: secrets.stackable.tech/v1alpha1
kind: SecretClass
metadata:
  name: s3-credentials-class
spec:
  backend:
    k8sSearch:
      searchNamespace:
        pod: {}

Referencing it:

...
credentials:
  secretClass: s3-credentials-class
...

What’s next

Find details about the options of the S3 resource in the S3 resources reference.