nifi-kafka-druid-earthquake-data

Install this demo on an existing Kubernetes cluster:

$ stackablectl demo install nifi-kafka-druid-earthquake-data

This demo only runs in the default namespace, as a ServiceAccount will be created. Additionally, we have to use the FQDN service names (including the namespace), so that the used TLS certificates are valid.

System requirements

To run this demo, your system needs at least:

9 cpu units (core/hyperthread)
42GiB memory (minimum of 16GiB per node)
75GiB disk storage

Overview

This demo will

Install the required Stackable operators.
Spin up the following data products:
- Superset: A modern data exploration and visualization platform. This demo utilizes Superset to retrieve data from Druid via SQL queries and build dashboards on top of that data.
- Kafka: A distributed event streaming platform for high-performance data pipelines, streaming analytics and data integration. This demo uses it as an event streaming platform to stream the data in near real-time.
- NiFi: An easy-to-use, robust system to process and distribute data. This demo uses it to fetch earthquake data from the internet and ingest it into Kafka.
- Druid: A real-time database to power modern analytics applications. This demo uses it to ingest the near real-time data from Kafka, store it and enable access to the data via SQL.
- MinIO: A S3 compatible object store. This demo uses it as persistent storage for Druid to store all the data.
Continuously emit approximately 10,000 records/s of earthquake data into Kafka.
Start a Druid ingestion job that ingests the data into the Druid instance.
Create Superset dashboards for visualization of the data.

The whole data pipeline will have a very low latency, from putting a record into Kafka to showing up in the dashboard charts. You can see the deployed products and their relationship in the following diagram:

List the deployed Stackable services

To list the installed Stackable services run the following command:

$ stackablectl stacklet list

┌───────────┬───────────────┬───────────┬─────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────┐
│ PRODUCT   ┆ NAME          ┆ NAMESPACE ┆ ENDPOINTS                                                                                   ┆ CONDITIONS                      │
╞═══════════╪═══════════════╪═══════════╪═════════════════════════════════════════════════════════════════════════════════════════════╪═════════════════════════════════╡
│ druid     ┆ druid         ┆ default   ┆ broker-https                                https://172.19.0.3:32293                        ┆ Available, Reconciling, Running │
│           ┆               ┆           ┆ coordinator-https                           https://172.19.0.4:31283                        ┆                                 │
│           ┆               ┆           ┆ router-https                                https://172.19.0.3:32286                        ┆                                 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kafka     ┆ kafka         ┆ default   ┆ broker-default-0-listener-broker-kafka-tls  172.19.0.4:32321                                ┆ Available, Reconciling, Running │
│           ┆               ┆           ┆ broker-default-0-listener-broker-metrics    172.19.0.4:30556                                ┆                                 │
│           ┆               ┆           ┆ broker-default-bootstrap-kafka-tls          172.19.0.4:31352                                ┆                                 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ nifi      ┆ nifi          ┆ default   ┆ node-https                                  https://172.19.0.2:32348                        ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ superset  ┆ superset      ┆ default   ┆ node-http                                   http://172.19.0.4:30769                         ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ zookeeper ┆ zookeeper     ┆ default   ┆ server-zk                                   zookeeper-server.default.svc.cluster.local:2282 ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ minio     ┆ minio-console ┆ default   ┆ http                                        http://172.19.0.3:32007                         ┆                                 │
└───────────┴───────────────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────┘

When a product instance has not finished starting yet, the service will have no endpoint. Depending on your internet connectivity, creating all the product instances might take considerable time. A warning might be shown if the product is not ready yet.

Inspect the data in Kafka

Kafka is an event streaming platform to stream the data in near real-time. All the messages put in and read from Kafka are structured in dedicated queues called topics. The test data will be put into a topic called earthquakes. The records are produced (written) by the test data generator and consumed (read) by Druid afterwards in the same order they were created.

Kafka uses mutual TLS, so clients wanting to connect to Kafka must present a valid TLS certificate. The easiest way to obtain this is to shell into the kafka-broker-default-0 Pod, as we will do in the following section for demonstration purposes. For a production setup, you should spin up a dedicated Pod provisioned with a certificate acting as a Kafka client instead of shell-ing into the Kafka Pod.

List the available Topics

You can execute a command on the Kafka broker to list the available topics as follows:

$ kubectl exec kafka-broker-default-0 -c kafka -- \
/stackable/kafka/bin/kafka-topics.sh \
--describe \
--bootstrap-server kafka-broker-default-headless.default.svc.cluster.local:9093 \
--command-config /stackable/config/client.properties
...
Topic: earthquakes	TopicId: ND51v_XcQPK4Ilm7A35Pag	PartitionCount: 8	ReplicationFactor: 1	Configs: min.insync.replicas=1,segment.bytes=100000000,retention.bytes=900000000
	Topic: earthquakes	Partition: 0	Leader: 1243966388	Replicas: 1243966388	Isr: 1243966388	Elr: 	LastKnownElr:
	Topic: earthquakes	Partition: 1	Leader: 1243966388	Replicas: 1243966388	Isr: 1243966388	Elr: 	LastKnownElr:
	Topic: earthquakes	Partition: 2	Leader: 1243966388	Replicas: 1243966388	Isr: 1243966388	Elr: 	LastKnownElr:
	Topic: earthquakes	Partition: 3	Leader: 1243966388	Replicas: 1243966388	Isr: 1243966388	Elr: 	LastKnownElr:
	Topic: earthquakes	Partition: 4	Leader: 1243966388	Replicas: 1243966388	Isr: 1243966388	Elr: 	LastKnownElr:
	Topic: earthquakes	Partition: 5	Leader: 1243966388	Replicas: 1243966388	Isr: 1243966388	Elr: 	LastKnownElr:
	Topic: earthquakes	Partition: 6	Leader: 1243966388	Replicas: 1243966388	Isr: 1243966388	Elr: 	LastKnownElr:
	Topic: earthquakes	Partition: 7	Leader: 1243966388	Replicas: 1243966388	Isr: 1243966388	Elr: 	LastKnownElr:

You can see that Kafka consists of one broker, and the topic earthquakes with eight partitions has been created. To see some records sent to Kafka, run the following command. You can change the number of records to print via the --max-messages parameter.

$ kubectl exec kafka-broker-default-0 -c kafka -- \
/stackable/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server kafka-broker-default-headless.default.svc.cluster.local:9093 \
--consumer.config /stackable/config/client.properties \
--topic earthquakes \
--offset earliest \
--partition 0 \
--max-messages 1

Below is an example of the output of one record:

 {
   "time":"1950-02-07T10:37:29.240Z",
   "latitude":45.949,
   "longitude":151.59,
   "depth":35.0,
   "mag":5.94,
   "magType":"mw",
   "nst":null,
   "gap":null,
   "dmin":null,
   "rms":null,
   "net":"iscgem",
   "id":"iscgem895202",
   "updated":"2022-04-26T18:23:38.377Z",
   "place":"Kuril Islands",
   "type":"earthquake",
   "horizontalError":null,
   "depthError":12.6,
   "magError":0.55,
   "magNst":null,
   "status":"reviewed",
   "locationSource":"iscgem",
   "magSource":"iscgem"
}

If you are interested in how many records have been produced to the Kafka topic so far, use the following command.

$ kubectl exec kafka-broker-default-0 -c kafka -- \
/stackable/kafka/bin/kafka-get-offsets.sh \
--bootstrap-server kafka-broker-default-headless.default.svc.cluster.local:9093 \
--command-config /stackable/config/client.properties \
--topic earthquakes
...
earthquakes:0:757379
earthquakes:1:759282
earthquakes:2:761924
earthquakes:3:761339
earthquakes:4:759059
earthquakes:5:767695
earthquakes:6:771457
earthquakes:7:768301

If you calculate 765,000 records * 8 partitions, you end up with ~ 6,120,000 records.

NiFi

NiFi is used to fetch earthquake data from the internet and ingest it into Kafka. This demo includes a workflow ("process group") that downloads a large CSV file, converts it to individual JSON records and produces the records into Kafka.

View the testdata-generation Job

You can have a look at the ingestion job running in NiFi by opening the endpoint https from your stackablectl stacklet list command output. In this case, it is https://172.19.0.2:32348. Open it with your favourite browser. Suppose you get a warning regarding the self-signed certificate generated by the Secret Operator (e.g. Warning: Potential Security Risk Ahead). In that case, you must tell your browser to trust the website and continue.

You can see the started ProcessGroup consisting of three processors. The first one - InvokeHTTP, fetches the CSV file from the Internet and puts it into the queue of the next processor. The second processor - SplitRecord, takes the single FlowFile (NiFi Record) which contains all CSV records and splits it into chunks of 2000 records, which are then separately put into the queue of the next processor. The third one - PublishKafka, parses the CSV chunk, converts it to JSON records and writes them out into Kafka.

Double-click on the InvokeHTTP processor to show the processor details.

Head over to the Properties tab.

Here, you can see the setting HTTP URL, which specifies the download URL from where the CSV file is retrieved. Close the processor details popup by clicking Close. Afterwards, double-click on the processor PublishKafka.

The Kafka connection service, which contains the connection details, and the topic name is specified. It uses the CSVReader to parse the downloaded CSV and the JsonRecordSetWriter to split it into individual JSON records before writing it out.

Druid

Druid is used to ingest the near real-time data from Kafka, store it and enable SQL access. The demo has started an ingestion job reading earthquake records from the Kafka topic earthquakes and saving them into Druid’s deep storage. The Druid deep storage is based on the S3 store provided by MinIO.

View the Ingestion job

You can have a look at the ingestion job running in Druid by opening the endpoint router-https from your stackablectl stacklet list command output (https://172.19.0.3:32286 in this case).

By clicking on Supervisors at the top, you can see the running ingestion jobs.

You can see additional information after clicking on the magnification glass to the right side of the RUNNING supervisor. On the tab Task stats on the left, you can see the number of processed records as well as the number of errors.

The statistics show that Druid ingested 13279 records per second within the last minute and has ingested around 600,000 records already. All entries have been consumed successfully, indicated by having no processWithError, thrownAway or unparseable records in the output of the View raw button at the top right.

Query the Data Source

The ingestion job has automatically created the Druid data source earthquakes. You can see the available data sources by clicking on Datasources at the top.

You can see the data source’s segments by clicking on segments under Availability for the earthquakes data source. In this case, the earthquakes data source is partitioned by the year of the earthquakes, resulting in 73 segments.

Druid offers a web-based way of querying the data sources via SQL. To achieve this, you must first click on Query at the top.

You can now enter any arbitrary SQL statement, to e.g. list 10 earthquakes run

select * from earthquakes limit 10

To count the number of earthquakes per year run

select
  time_format(__time, 'YYYY') as "year",
  count(*) as earthquakes
from earthquakes
group by 1
order by 1 desc

Superset

Superset provides the ability to execute SQL queries and build dashboards. Open the endpoint node-http in your browser (http://172.19.0.4:30769 in this case).

View the dashboard

The demo has created a Dashboard to visualize the earthquake data. To open it, click on the tab Dashboards at the top.

Click on the dashboard called Earthquakes. It might take some time until the dashboard renders all included charts.

View the charts

The dashboard Earthquakes consists of multiple charts. To list the charts, click on the tab Charts at the top.

Click on the Chart Number of earthquakes by magnitude. On the left side you can modify the chart and click on Update Chart to see the effect.

View the Earthquake Distribution on the World Map

To look at the geographical distribution of the earthquakes you have to click on the tab Charts at the top again. Afterwards click on the chart Earthquake distribution.

The distribution of the earthquakes matches the continental plate margins. This is the expected distribution from the Wikipedia article on Earthquakes.

Execute arbitrary SQL statements

Within Superset you can not only create dashboards but also run arbitrary SQL statements. On the top click on the tab SQL → SQL Lab.

On the left select the database druid, the schema druid and set See table schema to earthquakes.

On the right textbox enter the desired SQL statement. If you do not want to make one up, you can use the following:

select
  time_format(__time, 'YYYY') as "year",
  count(*) as earthquakes
from earthquakes
group by 1
order by 1 desc

MinIO

The S3 provided by MinIO is used as a persistent deep storage for Druid to store all the data used. Open the minio endpoint http in your browser (http://172.19.0.3:32007 in this case).

Click on the bucket demo and open the folders data → earthquakes.

As you can see Druid saved 201.5 MiB of data within 73 prefixes (folders). One prefix corresponds to one segment which in turn contains all the data of a year. If you don’t see any folders or files, the reason is that Druid has not saved its data from memory to the deep storage yet. After waiting for roughly an hour, the data should have been flushed to S3 and show up.

If you open up a prefix for a specific year you can see that Druid has placed a file containing the data of that year there.

Summary

The demo streamed 10,000 earthquake records/s for a total of ~3 million earthquakes into a Kafka steaming pipeline. Druid ingested the data near real-time into its data source and enabled SQL access to it. Superset was used as a web-based frontend to execute SQL statements and build dashboards.

Where to go from here

There are multiple paths to go from here. The following sections give you some ideas on what to explore next. You can find the description of the earthquake data on the United States Geological Survey website.

Execute arbitrary SQL statements

Within Superset (or the Druid web interface), you can execute arbitrary SQL statements to explore the earthquake data.

Create additional dashboards

You also can create additional charts and bundle them together in a Dashboard. Have a look at the Superset documentation on how to do that.

Load additional data

You can use the NiFi web interface to collect arbitrary data and write it to Kafka (it’s recommended to use new Kafka topics for that). Alternatively, you can use a Kafka client like kcat to create new topics and ingest data. Using the Druid web interface, you can start an ingestion job that consumes and stores the data in an internal data source. There is an excellent tutorial from Druid on how to do this. Afterwards, the data source can be analyzed within Druid and Superset, like the earthquake data.