Comparison of Kafka Connect Pipelines Series

In parts 6 and 7 of the pipeline series we took a different path in the pipe/tunnel and explored PostgreSQL and Apache Superset, mainly from a functional perspective—how can you get JSON data into PostgreSQL from Apache Kafka Connect, and what does it look like in Superset. In part 8, we ran some initial load tests and found out how the capacity of the original Elasticsearch pipeline compared with the PostgreSQL variant. These results were surprising, so in true “Myth Busters” style we had another attempt to make them more comparable (Part 9). In this blog, we sum up the results, and compare the two approaches for functionality, robustness, and performance, and come to some final conclusions.

Note that for these pipeline experiments I used Instaclustr managed Open Distro for Elasticsearch (Apache 2 license), but Instaclustr now offers Managed OpenSearch.

(Source: Shutterstock)

1. Kafka Connect Evaluation

First, some general observations about the use of Kafka Connect for building a streaming data pipeline:

It enabled heterogeneous source/sink “zero-code” integration of streaming JSON data
- There’s lots of connector configuration required
- Some custom mappings/ingest pipelines were required to get it working for Elasticsearch
- Custom SQL queries were needed for Superset and PostgreSQL
It provided good scalability and throughput, aided by custom monitoring (using Prometheus and Grafana and the Instaclustr monitoring API)
- It was easy to scale the pipeline capacity by increasing connector tasks and Kafka partitions
- However, you have to keep an eye on sink system scalability including physical resources and configurations (e.g. Elasticsearch shards, PostgreSQL database connections).
- It was the most resource-intensive cluster type, however:
  - It’s easy to add more nodes to the Kafka Connect cluster
  - Using a managed Kafka Connect service made cluster provisioning and monitoring of connector tasks easier.

Two excellent Kafka (and therefore Kafka Connect) superpowers are:

Ability to buffer spiky loads:
- Downstream sink systems are not overloaded, including sink tasks which can take a long time to recover
Ability to replay data:
- Data is persisted and is read-many, which is very useful for testing/debugging, and adding new sink systems etc.

However, open source Kafka connectors are of varying quality, functionality (including security, e.g. ability to use TLS certificates and trust stores), robustness, and performance!

You need to devote sufficient effort to understand your requirements, and evaluate and test connectors, with both valid and invalid data. Robustness is a function of connectors and sink systems (e.g. the ability of sink systems to detect incorrect JSON format or non-JSON messages, and the ability of the connectors to continue working in response).

Some connector code hacking was needed in order to get one connector (the PostgreSQL sink connector) working correctly with complex (structured), schemaless JSON NOAA tidal data. Finally, you need to adequately test connector performance and scalability in conjunction with the configured sink systems.

And now, here’s a final table-based comparison of the alternative pipeline technologies (Note that this is a high level summary, please read the previous pipeline series parts to understand some of the challenges we encountered and workarounds we used to come to these conclusions. In particular, the final throughput results for the Elasticsearch pipeline bypassed Kafka Connect as we couldn’t find an open source sink connector that supported the Elasticsearch BULK API andhad robust error handling).

2. Elasticsearch vs. PostgreSQL Evaluation

Elasticsearch vs. PostgreSQL

Feature	Elasticsearch	PostgreSQL
Kafka Sink Connectors	<Worked, but need Bulk API	Modified for JSON
Robustness	1 error type handled	2 error types handled
Mappings	Multiple on ingest side	None on ingest side
Resources/Cost	Comparable	Comparable
Write Throughput	Comparable (with caveats)	Comparable (with caveats)
Horizontal Scalability	Excellent	Untested

In scope for this comparison are Elasticseach and PostgreSQL, but also Kafka Connect and the two open source Kafka Connect Sink Connectors we used in previous blogs (the PostgreSQL sink connector, and the Elasticsearch sink connector). The first 3 rows of the table focus on the connectors, the last 3 rows focus on the total system (Kafka Connect and target sink clusters).

Even though PostgreSQL didn’t require any mappings on the ingest side (unlike Elasticsearch which needed custom mappings for data types and an ingest pipeline for creating a new geo_point field), it did automatically check that the JSON data was well-formed, and offered the ability to check for constraints (e.g. that specific fields exist).

The ease of scaling of connector tasks was both easier and harder for Elasticsearch (i.e. scaling tasks was easy, but a sink connector with support for Bulk API is needed for significant scalability – in part 9 we bypassed Kafka connect to get improved performance results using a different Elasticsearch client). For PostgreSQL, it was harder to understand and optimize the database connections and pooling, but no changes were needed to the connector to scale up the capacity limit of the PostreSQL server.

So in essence, both approaches worked for inserting JSON data into the respective sink systems. PostgreSQL was slightly more robust for error handling, but needed code changes to the connector to work at all. The Elasticsearch connector didn’t need code changes, but would need the addition of support for the Elasticsearch Bulk API to be competitive for performance. In theory, Elasticsearch write scalability is “infinite” with horizontal scalability, whereas you may hit a limit with PostgreSQL write scalability at some point.

3. Kibana vs. Superset Evaluation

Kibana vs Apache Superset

Feature	Kibana	Superset
Ease of Deployment	Trivial (managed service)	Possible (docker)
Customizations	None on Kibana side	Custom JSONB SQL queries
Charts	Good	More, including geospatial
Database Support	Only Elasticsearch	Many SQL (and Elasticsearch)

Because Kibana is available on the Instaclustr SaaS Platform (along with Elasticsearch), it was trivial to use it. Superset was harder to deploy, but I eventually got it running locally with Docker. Kibana “just works” with Elasticsearch and the JSON NOAA tidal data (as long as custom mappings and the pipeline are used), but PostgreSQL required some custom SQL queries for the JSONB data type to produce virtual data sets which could then be used the no-code Superset chart builder GUI.

I was impressed with the range of Superset chart types, particularly the powerful geospatial charts (although they were a bit tricky to select and use in ways that I wanted). However, I could graph the NOAA tidal data on both time-series and map/location charts using both Kibana and Superset.

Kibana is dedicated to Elasticsearch, whereas Superset can connect to many different SQL databases, and potentially Elasticsearch now as well.

So that’s the end of the pipeline (series) for the time being at least—pipes are frequently extended or need plumbing work done on them! It’s now time to enjoy the bounty of whatever comes out of the end of the pipe (often water)!

My dog hates baths but loves water out of a “hosepipe” (Source: Paul Brebner)

Further Resources

A talk based on the complete pipeline series was presented in the IoT track at ApacheCon 2021.

Reliable and Scalable Apache Kafka® Integrations With Apache Kafka Connect

Free White Paper

Follow the Pipeline Series

Part 1: Building a Real-Time Tide Data Processing Pipeline: Using Apache Kafka, Kafka Connect, Elasticsearch, and Kibana
Part 2: Building a Real-Time Tide Data Processing Pipeline: Using Apache Kafka, Kafka Connect, Elasticsearch, and Kibana
Part 3: Getting to Know Apache Camel Kafka Connectors
Part 4: Monitoring Kafka Connect Pipeline Metrics with Prometheus
Part 5: Scaling Kafka Connect Streaming Data Processing
Part 6: Streaming JSON Data Into PostgreSQL Using Open Source Kafka Sink Connectors
Part 7: Using Apache Superset to Visualize PostgreSQL JSON Data
Part 8: Kafka Connect Elasticsearch vs. PostgreSQL Pipelines: Initial Performance Results
Part 9: Kafka Connect and Elasticsearch vs. PostgreSQL Pipelines: Final Performance Results
Part 10: Comparison of Apache Kafka Connect, Plus Elasticsearch/Kibana vs. PostgreSQL/Apache Superset Pipelines

Comparison of Apache Kafka Connect, Plus Elasticsearch™/Kibana™ vs. PostgreSQL®/Apache Superset Pipelines: Conclusions (Pipeline Series Part 10)

1. Kafka Connect Evaluation

2. Elasticsearch vs. PostgreSQL Evaluation

3. Kibana vs. Superset Evaluation

Further Resources

Follow the Pipeline Series

About the author

Comparison of Apache Kafka Connect, Plus Elasticsearch™/Kibana™ vs. PostgreSQL®/Apache Superset Pipelines: Conclusions (Pipeline Series Part 10)

1. Kafka Connect Evaluation

2. Elasticsearch vs. PostgreSQL Evaluation

3. Kibana vs. Superset Evaluation

Further Resources

Follow the Pipeline Series

About the author

Get the latest articles for open sourceIn your inbox

Sign upto ourNewsletter

Get the latest articles for open source
In your inbox

Sign up
to our
Newsletter