Where we left off

In part 1, we deployed a managed Apache Kafka cluster with Terraform. That gave us a reliable message broker, but a broker sitting alone isn’t a pipeline—it’s just infrastructure waiting for a purpose.

Today, we add the pieces that turn Kafka into a real-time analytics system:

  • ClickHouse®: A columnar database that can query hundreds of millions of rows in milliseconds
  • Kafka® Connect: A distributed framework that streams data between systems without writing code

By the end, you’ll have three clusters talking to each other, all deployed from a single Terraform configuration.

How to build a streaming analytics pipeline with Terraform and Instaclustr—Part 2: Designing the complete data pipeline screenshot

Why this architecture?

Before we write code, let’s understand why this specific combination matters.

The problem

Imagine you’re processing user events—clicks, purchases, and page views. Kafka handles the ingestion beautifully: millions of events per second, no problem. But then someone asks, “How many users from California purchased something in the last hour?”

Kafka excels at moving data, not answering questions about it. That’s because it’s a streaming log optimize for throughput and durability—not ad hoc queries. For example, to ask “what happened last Tuesday?” you would need to land that data in a system built for analysis. That’s where ClickHouse comes in.

The traditional solution (and why it’s painful)

Most teams solve this with custom code: a consumer application that reads from Kafka, transforms the data, and writes to a database. It works, but now you’re maintaining:

  • A consumer application
  • Connection pooling
  • Retry logic
  • Schema evolution handling
  • Monitoring and alerting
  • Deployment pipelines for all of the above

That’s a lot of code for “move data from A to B.”

The better way: Kafka Connect

Kafka Connect exists precisely for this use case. Instead of writing custom consumers, you deploy connectors—pre-built plugins that handle the heavy lifting. Want data in ClickHouse? Deploy a ClickHouse sink connector. Need data from PostgreSQL? Deploy a PostgreSQL source connector.

No custom code. No deployment pipelines for data movement. Just configuration.

Who uses this pattern?

This isn’t experimental. Major companies run this exact architecture at scale:

The pattern works because it separates concerns: Kafka handles ingestion, Connect handles movement, ClickHouse handles queries. Each component does one thing well.

The architecture

Here’s what we’re building:

Three clusters, each in its own VPC:

Cluster  Network  Purpose 
Kafka  10.0.0.0/16  Message broker for event ingestion 
Kafka Connect  10.5.0.0/16  Streams data from Kafka to ClickHouse 
ClickHouse  10.6.0.0/16  Columnar database for analytics 

The NetApp Instaclustr does support a number of deployment models, such as co-locating these clusters inside the same VPC—but for simplicity today, let’s put them into separate VPC’s.

The firewall rules create a secure data flow: Kafka Connect can reach both Kafka and ClickHouse, but they can’t reach each other directly. Why does this matter? Each component only has access to what it needs. Kafka does not need to query ClickHouse. ClickHouse doesn’t need to read from Kafka (because that’s Connect’s job). By not opening unnecessary paths, you limit the blast radius if something goes wrong.

For example, if an attacker compromises ClickHouse, they can’t pivot directly to Kafka— they would have to go through Connect first, which is an additional barrier.

The complete Terraform configuration

This builds on part 1’s foundation. The code below includes everything from part one, plus new resources noted with #ClickHouse and #Kafka Connect comments.

Create a new directory with these files:

main.tf

terraform.tfvars

How to build a streaming analytics pipeline with Terraform and Instaclustr—Part 2: Designing the complete data pipeline screenshot

Understanding the new components

ClickHouse: The analytics engine

ClickHouse is a columnar database, which means it stores data by column rather than by row. This makes it extraordinarily fast for analytical queries—the kind that scan millions of rows but only need a few columns.

A few configuration notes:

  • shards = 1: We’re using a single shard for simplicity. Production deployments shard data across multiple nodes for horizontal scaling.
  • replicas = 3: Three copies of the data for fault tolerance. If a node fails, queries keep working.
  • network = "10.6.0.0/16": A separate CIDR block from Kafka. Each cluster lives in isolation.

Kafka Connect: The data bridge

The critical piece here is the target_cluster block. This tells Kafka Connect which Kafka cluster to connect to. Notice the Terraform reference:

This creates an implicit dependency: Terraform won’t create the Connect cluster until Kafka exists. You don’t need to manage ordering manually—Terraform figures it out from the references.

The kafka_connect_vpc_type = "KAFKA_VPC" setting places Kafka Connect in the same VPC as Kafka, enabling private communication without traversing the public internet.

Firewall rules: Security by design

Look at how the firewall rules create a directed data flow:

How to build a streaming analytics pipeline with Terraform and Instaclustr—Part 2: Designing the complete data pipeline screenshot

Kafka Connect is the only thing that needs to talk to both systems.

Deploy the pipeline

Running these commands will provision all 3 clusters, your pipeline VPC, subnets, route tables, and an EC2 test instance pre-loaded with the Kafka CLI. Think of the Terraform configuration as a recipe—it describes exactly what you want built, and these commands hand that recipe to AWS and Instaclustr to provision everything.

How to build a streaming analytics pipeline with Terraform and Instaclustr—Part 2: Designing the complete data pipeline screenshot

This deployment takes 20-30 minutes. Terraform creates the resources in dependency order: Kafka first, then ClickHouse and Kafka Connect in parallel, then all the firewall rules.

When complete, you’ll see outputs for all three clusters:

How to build a streaming analytics pipeline with Terraform and Instaclustr—Part 2: Designing the complete data pipeline screenshot

Verify the deployment

Check Kafka Connect

Kafka Connect exposes a REST API. Verify it’s running:

You should see an empty array [] — no connectors deployed yet, but the cluster is healthy.

Check ClickHouse

Connect using the ClickHouse client:

Run a test query:

You should see the ClickHouse version confirming the connection works.

Handling Terraform state issues

During deployment, you might encounter a 404 error if a cluster is recreated:

How to build a streaming analytics pipeline with Terraform and Instaclustr—Part 2: Designing the complete data pipeline screenshot

This happens when Terraform’s state file references a resource that no longer exists in Instaclustr. The fix is to remove the orphaned reference from state:

Then run terraform apply again. Terraform will recreate the firewall rules for the current cluster.

This is a normal part of working with Terraform and external providers. The state file is Terraform’s view of reality—sometimes it needs correction.

What we’ve built

At this point, you have:

  • A 3-broker Kafka cluster for event ingestion
  • A 3-replica ClickHouse cluster for analytics
  • A 3-worker Kafka Connect cluster ready for connectors
  • Firewall rules securing communication between them

All from ~150 lines of Terraform.

The infrastructure is ready. In part 3, we’ll add AWS VPC integration and switch Kafka Connect to VPC_PEERED mode – giving each cluster its own network so that we can connect your own AWS resources to all 3 over private networking.

Clean up

If you’re done for now, run:

This removes all three clusters and their firewall rules. When you’re ready to continue with part 3, run terraform apply again with the new settings we’ll discuss in part 3.

Key takeaways

  1. Kafka Connect eliminates custom consumer code—use connectors instead of writing data movement logic
  2. Each cluster gets its own network—isolation by design, with firewall rules controlling communication
  3. Terraform references create implicit dependencies—the target_kafka_cluster_id reference ensures correct ordering
  4. ClickHouse is purpose-built for analytics—columnar storage makes it orders of magnitude faster for aggregation queries than row-based databases

Why managed infrastructure? Building a streaming pipeline is one challenge. Keeping it running in production is another. Running Kafka, ClickHouse, and Kafka Connect yourself means your team owns the operations too—version upgrades, security patches, and on-call rotations. Instaclustr handles that so your team can stay focused on what the pipeline does, not on keeping it alive. The Terraform configuration in this series is a good example of that division: you simply describe what you want, and Instaclustr provisions and operates it from there.

See you in part 3, where we connect all of this to your AWS infrastructure, move the entire pipeline into your own AWS account, and connect everything over private networking.