end-to-end-security
This is a demo to showcase what can be done with Open Policy Agent around authorization in the Stackable Data Platform. It covers the following aspects of security:
This demo will:
-
Install the Stackable operators
-
Spin up the following data products
-
Trino: A fast distributed SQL query engine for big data analytics that helps you explore your data universe. This demo uses it to enable SQL access to the data.
-
Spark: A multi-language engine for executing data engineering, data science, and machine learning. This demo uses it to create a (rather simple) report and write the results back into the persistence.
-
HDFS: A distributed file system that is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
-
Hive metastore: A service that stores metadata related to Apache Hive and other services. This demo uses it as metadata storage for Trino and Spark.
-
Open policy agent (OPA): An open-source, general-purpose policy engine unifies policy enforcement across the stack. This demo uses it as the authorizer for Trino, which decides which user can query which data.
-
Superset: A modern data exploration and visualization platform. This demo utilizes Superset to retrieve data from Trino via SQL queries and build dashboards on top of that data.
-
-
Configure security to showcase the following features
-
Column- and row-level filtering
-
OIDC support across the board
-
Kerberos on Kubernetes
-
Keycloak and flexible group lookup
-
Open Policy Agent for the utmost flexibility in building access rules
-
The following figure gives an overview of how the components interact with each other:
$ stackablectl demo install end-to-end-security
This demo should not be run alongside other demos. |
System requirements
To run this demo, your system needs at least:
-
9 cpu units (core/hyperthread)
-
20GiB memory
-
40GiB disk storage
Recording
On 2024-05-16 our colleague Sönke Liebau held a Stackable TechTalk - Mastering Data Platform Security. You can find the recording on Youtube.
Overview
You can see the deployed products and their relationship in the following diagram:
Please note the different types of arrows used to connect the technologies in here, which symbolize how authentication happens along that route and if impersonation is used for queries executed.
The Trino schema (with schemas, tables and views) is shown below.
User credentials
The following user accounts are configured in Keycloak:
Username | Password | Team member |
---|---|---|
sophia.clarke |
sophia.clarke |
Head of Compliance Analytics |
william.lewis |
william.lewis |
Team member of Compliance Analytics |
daniel.king |
daniel.king |
Team member of Compliance Analytics |
pamela.scott |
pamela.scott |
Head of Customer Analytics |
justin.martin |
justin.martin |
Team member of Customer Analytics |
isla.williams |
isla.williams |
Team member of Customer Analytics |
mark.ketting |
mark.ketting |
Head of Marketing |
Ruleset
The rules that are configured in this demo show different options of giving full or restricted access to data with OPA.
General Access Control
At the highest level, everybody is allowed to see data from the schema of the department they are a member of. So in the following example, Justin Martin, who is a member of the Customer Service department will only be able to see tables from the Customer Service schema.