end-to-end-security

This is a demo to showcase what can be done with Open Policy Agent around authorization in the Stackable Data Platform. It covers the following aspects of security:

This demo will:

  • Install the Stackable operators

  • Spin up the following data products

    • Trino: A fast distributed SQL query engine for big data analytics that helps you explore your data universe. This demo uses it to enable SQL access to the data.

    • Spark: A multi-language engine for executing data engineering, data science, and machine learning. This demo uses it to create a (rather simple) report and write the results back into the persistence.

    • HDFS: A distributed file system that is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

    • Hive metastore: A service that stores metadata related to Apache Hive and other services. This demo uses it as metadata storage for Trino and Spark.

    • Open policy agent (OPA): An open-source, general-purpose policy engine unifies policy enforcement across the stack. This demo uses it as the authorizer for Trino and HDFS, which decides which user can query which data.

    • Superset: A modern data exploration and visualization platform. This demo utilizes Superset to retrieve data from Trino via SQL queries and build dashboards on top of that data.

  • Configure security to showcase the following features

    • Column- and row-level filtering

    • OIDC support across the board

    • Kerberos on Kubernetes

    • Keycloak and flexible group lookup

    • Open Policy Agent for the utmost flexibility in building access rules

Install this demo on an existing Kubernetes cluster:

$ stackablectl demo install end-to-end-security

This demo should not be run alongside other demos.

System requirements

To run this demo, your system needs at least:

  • 9 cpu units (core/hyperthread)

  • 20GiB memory

  • 40GiB disk storage

Recording

On 2024-05-16 our colleague Sönke Liebau held a Stackable TechTalk - Mastering Data Platform Security. You can find the recording on Youtube.

Overview

You can see the deployed products and their relationship in the following diagram:

Architectural overview

Note the different types of arrows used to connect the technologies in here, which symbolize how authentication happens along that route and if impersonation is used for queries executed.

The Trino schema (with schemas, tables and views) is shown below.

Trino schema

User credentials

The following user accounts are configured in Keycloak:

Username Password Team member

sophia.clarke

sophia.clarke

Head of Compliance Analytics

william.lewis

william.lewis

Team member of Compliance Analytics

daniel.king

daniel.king

Team member of Compliance Analytics

pamela.scott

pamela.scott

Head of Customer Analytics

justin.martin

justin.martin

Team member of Customer Analytics

isla.williams

isla.williams

Team member of Customer Analytics

mark.ketting

mark.ketting

Head of Marketing

Ruleset

The rules that are configured in this demo show different options of giving full or restricted access to data with OPA.

General Access Control

At the highest level, everybody is allowed to see data from the schema of the department they are a member of. So in the following example, Justin Martin, who is a member of the Customer Service department will only be able to see tables from the Customer Service schema.

e2e justin

Column-based Access Control

Sophia Clarke from the Compliance department can see tables for the Compliance department, but has also been given restricted access to the customers table.

The following diagram shows which rules are in place, you can easily test these with a sql editor of your chice.

e2e sophia

Row-level Access Control

Access control at the row level has been implemented on the employee table, where everybody can see information about themselves, as well as people who report to them.

e2e sophia employee

Decision logging (aka audit log)

The OPA server logs every single request it receives along with the decision it took to STDOUT. This gives you an audit log across the whole Data Platform.

List deployed Stackable services

To list the installed Stackable services run the following command:

$ stackablectl stacklet list
┌───────────┬──────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────┐
│ PRODUCT   ┆ NAME         ┆ NAMESPACE ┆ ENDPOINTS                                                                                                    ┆ CONDITIONS                      │
╞═══════════╪══════════════╪═══════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════╪═════════════════════════════════╡
│ hdfs      ┆ hdfs         ┆ default   ┆ datanode-default-0-listener-data     hdfs-datanode-default-0-listener.default.svc.cluster.local:9866         ┆ Available, Reconciling, Running │
│           ┆              ┆           ┆ datanode-default-0-listener-https    https://hdfs-datanode-default-0-listener.default.svc.cluster.local:9865 ┆                                 │
│           ┆              ┆           ┆ datanode-default-0-listener-ipc      hdfs-datanode-default-0-listener.default.svc.cluster.local:9867         ┆                                 │
│           ┆              ┆           ┆ datanode-default-0-listener-metrics  hdfs-datanode-default-0-listener.default.svc.cluster.local:8082         ┆                                 │
│           ┆              ┆           ┆ namenode-default-0-https             https://listener-hdfs-namenode-default-0.default.svc.cluster.local:9871 ┆                                 │
│           ┆              ┆           ┆ namenode-default-0-metrics           listener-hdfs-namenode-default-0.default.svc.cluster.local:8183         ┆                                 │
│           ┆              ┆           ┆ namenode-default-0-rpc               listener-hdfs-namenode-default-0.default.svc.cluster.local:8020         ┆                                 │
│           ┆              ┆           ┆ namenode-default-1-https             https://listener-hdfs-namenode-default-1.default.svc.cluster.local:9871 ┆                                 │
│           ┆              ┆           ┆ namenode-default-1-metrics           listener-hdfs-namenode-default-1.default.svc.cluster.local:8183         ┆                                 │
│           ┆              ┆           ┆ namenode-default-1-rpc               listener-hdfs-namenode-default-1.default.svc.cluster.local:8020         ┆                                 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hive      ┆ hive-iceberg ┆ default   ┆                                                                                                              ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ opa       ┆ opa          ┆ default   ┆                                                                                                              ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ superset  ┆ superset     ┆ default   ┆ external-http                        http://172.18.0.2:30443                                                 ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ trino     ┆ trino        ┆ default   ┆ coordinator-metrics                  172.18.0.2:32156                                                        ┆ Available, Reconciling, Running │
│           ┆              ┆           ┆ coordinator-https                    https://172.18.0.2:31604                                                ┆                                 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ zookeeper ┆ zookeeper    ┆ default   ┆                                                                                                              ┆ Available, Reconciling, Running │
└───────────┴──────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────┘

When a product instance has not finished starting yet, the service will have no endpoint. Depending on your internet connectivity, creating all the product instances might take considerable time. A warning might be shown if the product is not ready yet.

Explore the Data

Connect to Trino

Have a look at the trino-operator documentation on how to connect to Trino. It is recommended to use DBeaver, as Trino consists of many schemas and tables and they are easier to explore via a UI. The host and port for the connection can be found in the stackablectl stacklet list output used earlier using the coordinator-https endpoint. In this case it would be https://172.18.0.2:31604.

Because this demo uses OIDC for authentication there are slightly different settings required than what are described in trino-operator documentation linked above:

  • The Username and Password fields must be left blank.

  • An addition setting in the Driver Properties must be added: externalAuthentication: true.

dbeaver 1

When connecting to Trino you should be redirected to the Keycloak login screen, where you can use the credentials listed here.

keycloak 1

Access the data

As an example, logging in using the user justin.martin results in the following view of the data:

dbeaver 2

This user has only access to the employees and customer_analytics schemas and not to the compliance_analytics and marketing schemas as defined by the ruleset.

Security Features

Column- and row-level filtering

To demonstrate column-level filtering, this demo has different access configurations for different departments. justin.martin (Customer Analytics department) has full access to the customer table of the customer-analytics schema, while sophia.clarke (Compliance Analytics department) only sees a subset of the columns in the same table. This can be tested by running the following SQL query as justin.martin and sophia.clarke respectively:

select * from lakehouse.customer_analytics.customer;

For sophia.clarke this query will return an error, since the query is trying to fetch all columns of the table and sophia.clarke is not authorized for all of them. To show only the columns which are accessible using the same query, run the following session configuration query first:

set session hide_inaccessible_columns = true;
dbeaver 3

The ACLs configuring this behavior (written with the Rego language) on the customer table look like the following code snippet for the Compliance Analytics department. By setting the allow property of a column to false, the column becomes inaccessible for the group set in the group property. Additionally, in this example the contents of the c_customer_id and c_email_address columns are masked, one by encrypting the contents with sha256 and the other by replacing parts of the email address using a regular expression.

package trino_policies

import rego.v1

policies := {
    "tables": [
        {
            "group": "/Compliance and Regulation/Analytics",
            "catalog": "lakehouse",
            "schema": "customer_analytics",
            "table": "customer",
            "privileges": ["SELECT"],
            "columns" : [
            {"name": "c_first_name", "allow": false},
            {"name": "c_last_name", "allow": false},
            {"name": "c_birth_day", "allow": false},
            {"name": "c_birth_month", "allow": false},
            {
                "name": "c_customer_id",
                "mask": "'sha256:' || to_hex(sha256(to_utf8(c_customer_id)))",
            },
            {
                "name": "c_email_address",
                "mask": "regexp_replace(c_email_address, '([^@]{1,3})([^@]+)@', '$1---@')",
            },
            ]
        }
    ]
}

This is just a snippet from the full ACL definition that can be found in the trino-policies ConfigMap deployed by the end-to-end-security demo.

The demo also contains examples of row-level filtering. One of these is the restriction on the employees table, where employees can only see themselves and not the records of other employees (except if they are their supervisor).

Logging in as justin.martin results in these rows being shown:

dbeaver 4

The Rego rule for this behavior looks like this (again a snippet from the trino-policies ConfigMap):

package trino_policies

import rego.v1

policies := {
    "tables": [
        {
            "catalog": "lakehouse",
            "schema": "employees",
            "table": "employees",
            "privileges": ["SELECT"],
            "filter": "username = current_user or supervisor = current_user",
        }
    ]
}

Keycloak and flexible group lookup

To connect to the deployed Keycloak instance use a port-forward on the keycloak Service deployed in your cluster:

$ kubectl port-forward service/keycloak 8443:8443

And then open the url https://localhost:8443 to access the Keycloak UI. You can login using the credentials admin for username and adminadmin for password.

The demo has two realms configured, master and demo, shown in the screenshot below. The demo realm is the relevant one with the organizational configurations for the group memberships of the employees.

keycloak 2

Selecting Groups in the menu on the left, gives an overview of the groups defined in Keycloak.

keycloak 3

By navigating to the Customer Service group (clicking on it in the groups overview), then the Analytics child group and lastly selecting the Members tab, you can see the users belonging to that group.

keycloak 4

Click the Add member button, select sophia.clarke and press Add to add her to the Customer Service group.

keycloak 5

Using the Trino connection again logged in as sophia.clarke you can run the following command again, and compare the output to the one before the group change. sophia.clarke should now be able to see more columns than before.

select * from lakehouse.customer_analytics.customer;
dbeaver 5

OIDC support across the board

If you already logged in to Trino, you can try opening up the Superset UI. That is reachable with the external-http endpoint provided by stackablectl stacklet list. In this case the endpoint is http://172.18.0.2:30443. You should see the Sign in with Keycloak button on your screen.

superset 1

When clicking on the button, you are redirected to the Superset UI being already logged in without the need to enter any credentials. That’s because Superset can make use of Single Sign On thanks to already being logged in via Trino.

superset 2

Open Policy Agent for the utmost flexibility in building access rules

Some examples of access rules you can define for the Open Policy Agent authorization mechanism were already shown in the Column- and row-level filtering section. For more examples, check out the trino-policies ConfigMap of this demo and also the ACL section of the 'Mastering Data Platform Security' recording from above.