Stackable Operator for Apache Hive

This is an operator for Kubernetes that can manage Apache Hive metastores. The Apache Hive metastore (HMS) was originally developed as part of Apache Hive. It stores information on the location of tables and partitions in file and blob storages such as Apache HDFS and S3 and is now used by other tools besides Hive as well to access tables in files. This operator does not support deploying Hive itself, but Trino is recommended as an alternative query engine.

Getting started

Follow the Getting started guide which guides you through the installation of the Stackable Hive operator and its dependencies. It walks you through setting up a Hive metastore and connecting it to a demo Postgres database and a Minio instance to store data in.

Afterwards you can consult the Usage guide to learn more about tailoring your Hive metastore configuration to your needs, or have a look at the demos for some example setups with either Trino or Spark.

Operator model

The operator manages the HiveCluster custom resource. The cluster implements a single metastore role.

A diagram depicting the Kubernetes resources created by the Stackable operator for Apache Hive

For every role group the operator creates a ConfigMap and StatefulSet which can have multiple replicas (Pods). Every role group is accessible through its own Service, and there is a Service for the whole cluster.

The operator creates a service discovery ConfigMap for the Hive metastore instance. The discovery ConfigMap contains information on how to connect to the HMS.

Dependencies

The Stackable operator for Apache Hive depends on the Stackable commons, secret and listener operators.

Required external component: An SQL database

The Hive metastore requires an SQL database to store metadata. Consult the required external components page for an overview of the supported databases and minimum supported versions.

Demos

Three demos make use of the Hive metastore.

The spark-k8s-anomaly-detection-taxi-data and trino-taxi-data use the HMS to store metadata information about taxi data. The first demo then analyzes the data using Apache Spark and the second one using Trino.

The data-lakehouse-iceberg-trino-spark demo is the biggest demo available. It uses both Spark and Trino for analysis.

Why is the Hive query engine not supported?

Only the metastore is supported, not Hive itself. There are several reasons why running Hive on Kubernetes may not be an optimal solution. The most obvious reason is that Hive requires YARN as an execution framework, and YARN assumes much of the same role as Kubernetes - i.e. assigning resources. For this reason we provide Trino as a query engine in the Stackable Data Platform instead of Hive. Trino still uses the Hive Metastore, hence the inclusion of this operator as well. Trino should offer all the capabilities Hive offers including a lot of additional functionality, such as connections to other data sources.

Additionally, Tables in the HMS can also be accessed from Apache Spark.

Supported versions

The Stackable operator for Apache Hive currently supports the Hive versions listed below. To use a specific Hive version in your HiveCluster, you have to specify an image - this is explained in the Product image selection documentation. The operator also supports running images from a custom registry or running entirely customized images; both of these cases are explained under Product image selection as well.

4.0.1 (experimental) - Spark Iceberg jobs may fail as described here
4.0.0 (LTS)
3.1.3 (deprecated)

Useful links

The hive-operator GitHub repository
The operator feature overview in the feature tracker
The HiveCluster CRD documentation