In parts 6 and 7 of the pipeline series we took a different path in the pipe/tunnel and explored PostgreSQL and Apache Superset, mainly from a functional perspective—how can you get JSON data into PostgreSQL from Apache Kafka Connect, and what does it look like in Superset. In part 8, we ran some initial load tests and found out how the capacity of the original Elasticsearch pipeline compared with the PostgreSQL variant. These results were surprising, so in true “Myth Busters” style we had another attempt to make them more comparable (Part 9). In this blog, we sum up the results, and compare the two approaches for functionality, robustness, and performance, and come to some final conclusions.
Note that for these pipeline experiments I used Instaclustr managed Open Distro for Elasticsearch (Apache 2 license), but Instaclustr now offers Managed OpenSearch.
1. Kafka Connect Evaluation
First, some general observations about the use of Kafka Connect for building a streaming data pipeline:
- It enabled heterogeneous source/sink “zero-code” integration of streaming JSON data
- There’s lots of connector configuration required
- Some custom mappings/ingest pipelines were required to get it working for Elasticsearch
- Custom SQL queries were needed for Superset and PostgreSQL
- It provided good scalability and throughput, aided by custom monitoring (using Prometheus and Grafana and the Instaclustr monitoring API)
- It was easy to scale the pipeline capacity by increasing connector tasks and Kafka partitions
- However, you have to keep an eye on sink system scalability including physical resources and configurations (e.g. Elasticsearch shards, PostgreSQL database connections).
- It was the most resource-intensive cluster type, however:
- It’s easy to add more nodes to the Kafka Connect cluster
- Using a managed Kafka Connect service made cluster provisioning and monitoring of connector tasks easier.
Two excellent Kafka (and therefore Kafka Connect) superpowers are:
- Ability to buffer spiky loads:
- Downstream sink systems are not overloaded, including sink tasks which can take a long time to recover
- Ability to replay data:
- Data is persisted and is read-many, which is very useful for testing/debugging, and adding new sink systems etc.
However, open source Kafka connectors are of varying quality, functionality (including security, e.g. ability to use TLS certificates and trust stores), robustness, and performance!
You need to devote sufficient effort to understand your requirements, and evaluate and test connectors, with both valid and invalid data. Robustness is a function of connectors and sink systems (e.g. the ability of sink systems to detect incorrect JSON format or non-JSON messages, and the ability of the connectors to continue working in response).
Some connector code hacking was needed in order to get one connector (the PostgreSQL sink connector) working correctly with complex (structured), schemaless JSON NOAA tidal data. Finally, you need to adequately test connector performance and scalability in conjunction with the configured sink systems.
And now, here’s a final table-based comparison of the alternative pipeline technologies (Note that this is a high level summary, please read the previous pipeline series parts to understand some of the challenges we encountered and workarounds we used to come to these conclusions. In particular, the final throughput results for the Elasticsearch pipeline bypassed Kafka Connect as we couldn’t find an open source sink connector that supported the Elasticsearch BULK API andhad robust error handling).
2. Elasticsearch vs. PostgreSQL Evaluation
|Kafka Sink Connectors||<Worked, but need Bulk API||Modified for JSON|
|Robustness||1 error type handled||2 error types handled|
|Mappings||Multiple on ingest side||None on ingest side|
|Write Throughput||Comparable (with caveats)||Comparable (with caveats)|
In scope for this comparison are Elasticseach and PostgreSQL, but also Kafka Connect and the two open source Kafka Connect Sink Connectors we used in previous blogs (the PostgreSQL sink connector, and the Elasticsearch sink connector). The first 3 rows of the table focus on the connectors, the last 3 rows focus on the total system (Kafka Connect and target sink clusters).
Even though PostgreSQL didn’t require any mappings on the ingest side (unlike Elasticsearch which needed custom mappings for data types and an ingest pipeline for creating a new geo_point field), it did automatically check that the JSON data was well-formed, and offered the ability to check for constraints (e.g. that specific fields exist).
The ease of scaling of connector tasks was both easier and harder for Elasticsearch (i.e. scaling tasks was easy, but a sink connector with support for Bulk API is needed for significant scalability – in part 9 we bypassed Kafka connect to get improved performance results using a different Elasticsearch client). For PostgreSQL, it was harder to understand and optimize the database connections and pooling, but no changes were needed to the connector to scale up the capacity limit of the PostreSQL server.
So in essence, both approaches worked for inserting JSON data into the respective sink systems. PostgreSQL was slightly more robust for error handling, but needed code changes to the connector to work at all. The Elasticsearch connector didn’t need code changes, but would need the addition of support for the Elasticsearch Bulk API to be competitive for performance. In theory, Elasticsearch write scalability is “infinite” with horizontal scalability, whereas you may hit a limit with PostgreSQL write scalability at some point.
3. Kibana vs. Superset Evaluation
|Ease of Deployment||Trivial (managed service)||Possible (docker)|
|Customizations||None on Kibana side||Custom JSONB SQL queries|
|Charts||Good||More, including geospatial|
|Database Support||Only Elasticsearch||Many SQL (and Elasticsearch)|
Because Kibana is available on the Instaclustr SaaS Platform (along with Elasticsearch), it was trivial to use it. Superset was harder to deploy, but I eventually got it running locally with Docker. Kibana “just works” with Elasticsearch and the JSON NOAA tidal data (as long as custom mappings and the pipeline are used), but PostgreSQL required some custom SQL queries for the JSONB data type to produce virtual data sets which could then be used the no-code Superset chart builder GUI.
I was impressed with the range of Superset chart types, particularly the powerful geospatial charts (although they were a bit tricky to select and use in ways that I wanted). However, I could graph the NOAA tidal data on both time-series and map/location charts using both Kibana and Superset.
Kibana is dedicated to Elasticsearch, whereas Superset can connect to many different SQL databases, and potentially Elasticsearch now as well.
So that’s the end of the pipeline (series) for the time being at least—pipes are frequently extended or need plumbing work done on them! It’s now time to enjoy the bounty of whatever comes out of the end of the pipe (often water)!
A talk based on the complete pipeline series was presented in the IoT track at ApacheCon 2021.