jupyterhub-keycloak

Installation

To install the demo on an existing Kubernetes cluster, use the following command:

$ stackablectl demo install jupyterhub-keycloak

Accessing the JupyterHub Interface

Navigate to the JupyterHub web interface using the NodePort IP and port (e.g., http://<ip>:31095)
Log in using the predefined user credentials (e.g., justin.martin or isla.williams with the password matching the username)
Select a notebook (provided by the Jupyter community) profile and start processing data using the provided notebook

System requirements

To run this demo, your system needs at least:

8 cpu units (core/hyperthread)
32GiB memory

Additional resources may be required depending on the number of concurrent users and their selected notebook profiles.

Overview

The JupyterHub-Keycloak integration demo offers a comprehensive and secure multi-user data science environment on Kubernetes. This demo highlights several key features:

Secure Authentication: Utilizes Keycloak for robust user authentication and identity management
Dynamic Spark Integration: Demonstrates how to start a distributed Spark cluster directly from a Jupyter notebook, with dynamic resource allocation
S3 Storage Interaction: Illustrates reading from and writing to an S3-compatible storage (MinIO) using Spark, with secure credential management
Scalable and Flexible: Leverages Kubernetes for scalable resource management, allowing users to select from predefined resource profiles
User-Friendly: Provides an intuitive interface for data scientists to perform common data operations with ease

This demo will:

Install the required Stackable Data Platform operators
Spin up the following data products:
- JupyterHub: A multi-user server for Jupyter notebooks
- Keycloak: An identity and access management product
- S3: A Minio instance for data storage
Download a sample of gas sensor measurements* into S3
Install the Jupyter notebook
Demonstrate some basic data operations against S3
Enable multi-user usage

Introduction to the Demo

The JupyterHub-Keycloak demo is designed to provide data scientists with a typical environment for data analysis and processing. This demo integrates JupyterHub with Keycloak for secure user management and utilizes Apache Spark for distributed data processing. The environment is deployed on a Kubernetes cluster, ensuring scalability and efficient resource utilization.

There are some security considerations to be aware of if using distributed Spark. Each Spark cluster runs using the same service account, and it is possible for an executor pod to mount any secret in the namespace. It is planned to implement OPA gatekeeper rules in later versions of this demo to restrict this. This feature is not yet implemented and in the meantime, users' environments are kept separate but not private.

Showcased Features of the Demo

Secure User Authentication with Keycloak

OAuthenticator: JupyterHub is configured to use Keycloak for user authentication, ensuring secure and manageable access control.
Admin Users: Certain users (e.g. for this demo: isla.williams) are configured as admin users with access to user management features in the JupyterHub admin console.

Dynamic Spark Configuration

Client Mode: Spark is configured to run in client mode, with the notebook acting as the driver. This setup is ideal for interactive data processing.
Executor Management: Spark executors are dynamically spawned as Kubernetes pods, with executor resources being defined by each user’s Spark session.
Compatibility: Ensures compatibility between the driver and executor by matching Spark, Python, and Java versions.

S3 Storage Integration

MinIO: Utilizes MinIO as an S3-compatible storage solution for storing and retrieving data.
Secure Credential Management: MinIO credentials are managed using Kubernetes secrets, keeping them separate from notebook code.
Data Operations: Demonstrates reading from and writing to S3 storage using Spark, with support for CSV and Parquet formats.

Configuration Settings Overview

Keycloak Configuration

Deployment: Keycloak is deployed using a Kubernetes Deployment with a ConfigMap for realm configuration.
Services: Keycloak and JupyterHub services use fixed NodePorts (31093 for Keycloak and 31095 for JupyterHub).

JupyterHub Configuration

Authentication: Configured to use GenericOAuthenticator for authenticating against Keycloak.
Certificates: Utilizes self-signed certificates for secure communication between JupyterHub and Keycloak.
Endpoints: Endpoints for OAuth callback, authorization, token- and user-data are dynamically set using environment variables and a ConfigMap.

Spark Configuration

Executor Image: Uses a custom image oci.stackable.tech/demos/spark:3.5.2-python311 (built on the standard Spark image) for the executors, matching the Python version of the notebook.
Resource Allocation: Configures Spark executor instances, memory, and cores through settings defined in the notebook.
Hadoop and AWS Libraries: Includes necessary Hadoop and AWS libraries for S3 operations, matching the notebook image version.

For more details, see the tutorial.

Detailed Demo/Notebook Walkthrough

The demo showcases an ipython notebook that begins by outputting the versions of Python, Java, and PySpark being used. It reads MinIO credentials from a mounted secret to access the S3 storage. This ensures that the environment is correctly set up and that the necessary credentials are available for S3 operations. The notebook configures Spark to interact with an S3 bucket hosted on MinIO. It includes necessary Hadoop and AWS libraries to facilitate S3 operations. The Spark session is configured with various settings, including executor instances, memory, and cores, to ensure optimal performance.

The demo then performs various data processing tasks, including:

Creating an In-Memory DataFrame: Verifies compatibility between the driver and executor libraries.
Inspecting S3 Buckets with PyArrow: Lists files in the S3 bucket using the PyArrow library.
Read/Write Operations: Demonstrates reading CSV data from S3, performing basic transformations, and writing the results back to S3 in CSV and Parquet formats.
Data Aggregation: Aggregates data by hour and writes the aggregated results back to S3.
DataFrame Conversion: Shows how to convert between Spark and Pandas DataFrames for further analysis or visualization.

Users

The same users as in the End-to-end-security demo are configured in Keycloak and these will be used as examples.

JupyterHub

Have a look at the available Pods before logging in:

$ kubectl get pods
NAME                         READY   STATUS      RESTARTS   AGE
hub-84f49ccbd7-29h7j         1/1     Running     0          56m
keycloak-544d757f57-f55kr    2/2     Running     0          57m
load-gas-data-m6z5p          0/1     Completed   0          54m
minio-5486d7584f-x2jn8       1/1     Running     0          57m
proxy-648bf7f45b-62vqg       1/1     Running     0          56m

The proxy Pod has an associated proxy-public service with a statically-defined port (31095), exposed with type NodePort. The keycloak Pod has a Service called keycloak with a fixed port (31093) of type NodePort as well. The node port IP can be found in the ConfigMap keycloak-address (written by the Keycloak Deployment as it starts up). Both the Keycloak and the JupyterHub web interface can be accessed via this address, on ports 31093 and 31095 respectively.

On Kind this can be any node - not necessarily the one where the proxy Pod is running. This is due to the way in which Docker networking is used within the cluster. On other clusters it will be necessary to use the exact Node on which the proxy is running.

In the example below that would then be 172.19.0.5:31095:

apiVersion: v1
data:
  keycloakAddress: 172.19.0.5:31093 # Keycloak itself
  keycloakNodeIp: 172.19.0.5 # can be used to access the proxy-public service
kind: ConfigMap
metadata:
  name: keycloak-address
  namespace: default

The hub Pod may show a CreateContainerConfigError for a few moments on start-up as it requires the ConfigMap written by the Keycloak deployment.

You should see the JupyterHub login page, which will indicate a re-direct to the OAuth service (Keycloak):

Click on the sign-in button. You will be redirected to the Keycloak login, where you can enter one of the aforementioned users (e.g. justin.martin or isla.williams: the password is the same as the username):

A successful login will redirect you back to JupyterHub where different profiles are listed (the drop-down options are visible when you click on the respective fields):

The explorer window on the left includes a notebook that is already mounted.

Double-click on the file notebook/process-s3.ipynb:

Run the notebook by selecting "Run All Cells" from the menu:

The notebook includes some comments regarding image compatibility and uses a custom image built off the official Spark image that matches the Spark version used in the notebook. The java versions also match exactly. Python versions need to match at the major:minor level, which is why Python 3.11 is used in the custom image.

Once the spark executor has been started (we have specified spark.executor.instances = 1) it will spin up as an extra pod. We have named the spark job to incorporate the current user (justin-martin). JupyterHub has started a pod for the user’s notebook instance (jupyter-justin-martin---bdd3b4a1) and another one for the spark executor (process-s3-jupyter-justin-martin-bdd3b4a1-9e9da995473f481f-exec-1):

$ kubectl get pods
NAME                                   READY   STATUS      RESTARTS   AGE
...
jupyter-justin-martin---bdd3b4a1       1/1     Running     0          17m
process-s3-jupyter-justin-martin-...   1/1     Running     0          2m9s
...

Stop the kernel in the notebook (which will shut down the spark session and thus the executor) and log out as the current user. Log in now as daniel.king and then again as isla.williams (you may need to do this in a clean browser sessions so that existing login cookies are removed). This user has been defined as an admin user in the jupyterhub configuration:

  ...
  hub:
    config:
      Authenticator:
        # don't filter here: delegate to Keycloak
        allow_all: True
        admin_users:
          - isla.williams
  ...

You should now see user-specific pods for all three users:

$ kubectl get pods
NAME                               READY   STATUS      RESTARTS   AGE
...
jupyter-daniel-king---181a80ce     1/1     Running     0          6m17s
jupyter-isla-williams---14730816   1/1     Running     0          4m50s
jupyter-justin-martin---bdd3b4a1   1/1     Running     0          3h47m
...

The admin user (isla.williams) will also have an extra Admin tab in the JupyterHub console where current users can be managed. You can find this in the JupyterHub UI at http://<ip>:31095/hub/admin e.g http://172.19.0.5:31095/hub/admin:

You can inspect the S3 buckets by using stackable stacklet list to return the Minio endpoint and logging in there with admin/adminadmin:

$ stackablectl stacklet list

┌─────────┬───────────────┬───────────┬───────────────────────────────┬────────────┐
│ PRODUCT ┆ NAME          ┆ NAMESPACE ┆ ENDPOINTS                     ┆ CONDITIONS │
╞═════════╪═══════════════╪═══════════╪═══════════════════════════════╪════════════╡
│ minio   ┆ minio-console ┆ default   ┆ http  http://172.19.0.5:32470 ┆            │
└─────────┴───────────────┴───────────┴───────────────────────────────┴────────────┘

if you attempt to re-run the notebook you will need to first remove the _temporary folders from the S3 buckets. These are created by spark jobs and are not removed from the bucket when the job has completed.

Where to go from here

Add your own data

You can augment the demo dataset with your own data by creating new buckets and folders and uploading your own data via the MinIO UI.

Scale up and out

There are several possibilities here (all of which will depend to some degree on resources available to the cluster):

Allocate more CPU and memory resources to the JupyterHub notebooks or change notebook profiles by modifying the singleuser.profileList in the Helm chart values
add concurrent users
alter Spark session settings by changing spark.executor.instances, spark.executor.memory or spark.executor.cores
Integrate other data sources, for example HDFS (see the JupyterHub-Pyspark demo)

Conclusion

The JupyterHub-Keycloak integration demo, with its dynamic Spark integration and S3 storage interaction, is a great starting point for data scientists to begin building complex data operations.

For further details and customization options, refer to the demo notebook and configuration files provided in the repository. This environment is ideal for data scientists with a platform engineering background, offering a template solution for secure and efficient data processing.

*See: Burgués, Javier, Juan Manuel Jiménez-Soto, and Santiago Marco. "Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models." Analytica chimica acta 1013 (2018): 13-25 Burgués, Javier, and Santiago Marco. "Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors." Analytica chimica acta 1019 (2018): 49-64.