Welcome to Stackable! This documentation gives you an overview of the Stackable Data Platform, how to install and manage it as well as some tutorials.
The Stackable Data Platform allows you to deploy, scale and manage Data infrastructure in any environment running Kubernetes.
Goal of the project
We are building a distribution of existing Open Source tools that together comprise the components of a modern data platform.
There are components to ingest data, to store data, to process and visualize and much more. While the platform got started in the Big Data ecosystem we are in no way limited to big data workloads.
You can declaratively build these environments, and we don’t stop at the tool level as we also provide ways for the users to interact with the platform in the "as Code"-approach.
We are leveraging the Open Policy Agent to provide Security-as-Code.
We are building a distribution that includes the “best of breed” of existing Open Source tools, but bundles them in a way, so it is easy to deploy a fully working stack of software. Most of the existing tools are “single purpose” tools, which often do not play nicely together out-of-the-box.
The Stackable platform consists of multiple operators that work together. Periodically a platform release is made, including all components of the platform at a specific version. See the latest release notes for 23.11 here.
We are using Kubernetes as our deployment platform. And we’re building Operators for each of the products we support. The Stackable Data Platform supports the following products:
Airflow is a workflow engine and your replacement should you be using Apache Oozie.
Apache Druid is a real-time database to power modern analytics applications.
HBase is a distributed, scalable, big data store.
HDFS is a distributed file system that provides high-throughput access to application data.
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. We support the Hive Metastore.
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
An easy to use, powerful, and reliable system to process and distribute data.
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Apache Superset is a modern data exploration and visualization platform.
Fast distributed SQL query engine for big data analytics that helps you explore your data universe.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.