jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data

This demo showcases the integration between JupyterLab, Spark Connect and Apache Hadoop deployed on the Stackable Data Platform (SDP) Kubernetes cluster. The SDP makes this integration easy by publishing a discovery ConfigMap for the HDFS cluster and a Spark Connect service. This ConfigMap is then mounted in all Pods running PySpark so that these have access to HDFS data. The Jupyter notebook is a lightweight client that delegates the model training to the Spark Connect service. For this demo, the HDFS cluster is provisioned with a small sample of the NYC taxi trip dataset, which is analyzed with a notebook that is provisioned automatically in the JupyterLab interface.

Install this demo on an existing Kubernetes cluster:

$ stackablectl demo install jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data

This demo should not be run alongside other demos.

System requirements

To run this demo, your system needs at least:

8 cpu units (core/hyperthread)
32GiB memory
22GiB disk storage

Aim / Context

This demo uses stackable operators to deploy a Spark Connect server and an HDFS cluster. The intention is to demonstrate how clients, in this case a JupyterLab notebook, can interact with SDP components. The notebook creates a SparkSession, that delegates the data analysis and model training to a Spark Connect service thus offloading resources into the Kubernetes cluster. The notebook can thus be used as a sandbox for writing, testing and benchmarking Spark jobs before they are moved into production.

Overview

This demo will:

Install the required Stackable Data Platform operators.
Spin up the following data products:
- JupyterLab: A web-based interactive development environment for notebooks.
- Apache HDFS: A distributed file system used to store the taxi dataset
Download a sample of the NY taxi dataset into HDFS.
Install Jupyter notebook.
Train an anomaly detection model using PySpark on the data available in HDFS.
Perform some predictions and visualize anomalies.

HDFS

The Stackable Operator for Apache HDFS will spin up an HDFS cluster to store the taxi dataset in Apache Parquet format. This dataset will be read and processed via PySpark.

Before trying out the notebook example in Jupyter, check if the taxi data was loaded to HDFS successfully:

$ kubectl exec -c namenode -it hdfs-namenode-default-0 -- /bin/bash -c "./bin/hdfs dfs -ls /ny-taxi-data/raw"
Found 1 items
-rw-r--r--   3 stackable supergroup  314689382 2022-11-23 15:01 /ny-taxi-data/raw/fhvhv_tripdata_2020-09.parquet

There should be one parquet file containing taxi trip data from September 2020.

JupyterLab

Have a look at the available Pods before logging in:

$ kubectl get pods
NAME                                           READY   STATUS      RESTARTS   AGE
hdfs-datanode-default-0                        1/1     Running     0          38m
hdfs-journalnode-default-0                     1/1     Running     0          38m
hdfs-namenode-default-0                        2/2     Running     0          38m
hdfs-namenode-default-1                        2/2     Running     0          36m
jupyterlab-76d67b9bfb-thmtq                    1/1     Running     0          22m
load-test-data-hcj92                           0/1     Completed   0          26m
spark-connect-server-66db874cbb-9nbpf          1/1     Running     0          34m
spark-connect-server-9c6bfd9690213314-exec-1   1/1     Running     0          34m
spark-connect-server-9c6bfd9690213314-exec-2   1/1     Running     0          34m
spark-connect-server-9c6bfd9690213314-exec-3   1/1     Running     0          34m
spark-connect-server-9c6bfd9690213314-exec-4   1/1     Running     0          34m
zookeeper-server-default-0                     1/1     Running     0          38m

In order to reach the JupyterLab web interface, create a port-forward:

$ kubectl port-forward service/jupyterlab 8080:http

The jupyterlab service is created along side the the JupyterLab deployment.

Now access the JupyterHub web interface via http://localhost:8080

You should see the JupyterLab login page.

Now you can double-click on the notebook folder on the left, open and run the contained file. Click on the double arrow (⏩️) to execute the Python scripts (click on the image above to go to the notebook file).

The Python notebook uses libraries such as pandas and scikit-learn to analyze the data. In addition, since the model training is delegated to a Spark Connect server, some of these dependencies, most notably scikit-learn, must also be made available on the Spark Connect pods. For convenience, a custom image is used in this demo that bundles all the required libraries for both the notebook and the Spark Connect server. The source of the image is available here.

In practice, clients of Spark Connect do not need a full-blown Spark installation available locally, but only the libraries that are used in the notebook.

Model details

The job uses an implementation of the Isolation Forest algorithm provided by the scikit-learn library: the model is trained and then invoked by a user-defined function running on the Spark Connect executors. This type of model attempts to isolate each data point by continually partitioning the data. Data closely packed together will require more partitions to separate data points. In contrast, any outliers will require less: the number of partitions needed for a particular data point is thus inversely proportional to the anomaly "score".

Visualization

The notebook shows how to plot the outliers against a particular metric (e.g. "number of rides"):

However, this is mainly for convenience - the anomaly score is derived from the entire feature space, i.e., it considers all dimensions (or features/columns) when scoring data, meaning that not only are the results challenging to visualize (how can multidimensional data be represented in only 3-D dimensional space?), but that a root cause analysis has to be a separate process. It would be tempting to look at just one metric and assume causal effects, but the model "sees" all features as a set of numerical values and derives patterns accordingly.

We can tackle the first of these issues by collapsing - or projecting - our data into a manageable number of dimensions that can be plotted. Once the script has finished successfully, plots should be displayed on the bottom that show the same data in 2D and 3D representation. The 3D plot should look like this:

The model has detected outliers even though that would not have been immediately apparent from the time-series representation alone.