End-to-end distributed tracing of complex applications is a critical pillar of modern distributed systems observability development and operations. Previously, I’ve used OpenTelemetry to trace a complex Kafka application, including Kafka Streams, and looked at the traces in several open source OpenTelemetry GUIs, including Jaeger, Uptrace, and SigNoz. And earlier, I used its predecessor, OpenTracing, to trace Anomalia Machina.

In part 1 of this blog, I explored option 1 using Jaeger. That approach was relatively easy to do, but it didn’t give the full functionality of the OpenSearch Trace Analytics visualizations. Now it’s time to investigate option 2 using a Data Prepper pipeline and see if we can unlock the full functionality.

OpenTelemetry Java Auto instrumentation recap

In part 1, we ran some Kafka clients with Java OpenTelemetry auto-instrumentation enabled. To recap, on the Kafka client side, this is what you have to run to automatically produce traces from a Java program:

The documentation and opentelemetry-javaagent.jar are available here.

For a more interesting use case and topology, I ran a producer (service.name=lotsofboxes) writing to a topic, and two consumers in different groups (service.name=consumerg1, service.name=consumerg2) reading from the topic.

Option 2: A complete and robust pipeline using OpenTelemetry Collectors and OpenSearch Data Prepper

In practice, the preferred way of getting OpenTelemetry traces into OpenSearch is as follows:

How to get OpenTelemetry traces from Apache Kafka clients into OpenSearch using Data Prepper and OpenTelemetry Collectors diagram 1

It looks like I need two extra components/steps: OpenTelemetry Collectors on the application side, and OpenSearch Data Prepper between the Collectors and OpenSearch to collect the traces from the Collectors, process them, and insert the data correctly in OpenSearch to enable visualizations.

Why is Data Prepper essential for this workflow? From this diagram, it looks like three internal pipelines are needed to process the traces correctly for use by OpenSearch:

How to get OpenTelemetry traces from Apache Kafka clients into OpenSearch using Data Prepper and OpenTelemetry Collectors diagram 2

The documentation explains them as follows. There are three processors for the trace analytics feature:

  • otel_traces
    • The otel_traces processor receives a collection of span records from otel-trace-source, and performs stateful processing, extraction, and completion of trace-group-related fields.
    • I assume this is part of the otel-trace-pipeline.
  • otel_traces_group
    • The otel_traces_group processor fills in the missing trace-group-related fields in the collection of span records by looking up the OpenSearch backend.
    • I was getting an error about “missing trace groups” in option 1 above, so this is presumably the missing step.
  • service_map
    • The service_map processor performs the required preprocessing for trace data and builds metadata to display the service-map dashboards.
    • I assume this is part of the service-map-pipeline.

I was puzzled about “trace groups” (as this is not an OpenTelemetry concept), but the documentation is sparse. AI suggested the following explanation:

In OpenSearch, when integrating with OpenTelemetry for distributed tracing, a trace group refers to a logical grouping of related traces or spans that share common characteristics or are associated with a particular service, operation, or application. It provides a way to organize and filter your trace data for easier analysis and visualization.

And groups are apparently customizable:

Trace groups can be defined based on various common characteristics, such as:

  • Service Name: All traces originating from or passing through a specific service can belong to a particular trace group (this appears to be the default).
  • Operation Name: Traces related to a specific operation within a service (e.g., “checkout,” “login”) can form a trace group.
  • Application: Traces from a particular application can be grouped together.
  • Custom Attributes: You can leverage custom attributes added to your OpenTelemetry spans to create highly specific trace groups based on business logic or other relevant metadata.

OpenSearch provides a generic sink that writes data to OpenSearch as the destination. The OpenSearch sink has configuration options related to the OpenSearch cluster, such as endpoint, SSL, username/password, index name, index template, and index state management.

The sink provides specific configurations for the trace analytics feature. These configurations allow the sink to use indexes and index templates specific to trace analytics. The following OpenSearch indexes are specific to trace analytics:

  • otel-v1-apm-span
    • The otel-v1-apm-span index stores the output from the otel_traces processor.
  • otel-v1-apm-service-map
    • The otel-v1-apm-service-map index stores the output from the service_map processor.

So, a lot is going on in these pipelines, which explains why the service maps were not working using just Jaeger in option 1.

Get, configure, and run Data Prepper

According to the documentation, Data Prepper is:

In practice, this probably means you would be running it on substantial servers and/or in the cloud. However, I just wanted to test the pipeline, so I ran it locally on my Mac. The getting started guide is here, and I got a local copy with this command:

Given the number of pipelines needed, and to run it with our managed OpenSearch, you need to write a specific configuration file; here’s an example.

Here’s my configuration file (opensearch_otel_config.yaml):

This pipeline takes data from the OpenTelemetry Collector (entry-pipeline) and uses two other pipelines as sinks. These two separate pipelines serve two different purposes and write to different OpenSearch indexes. The first pipeline (raw-trace-pipeline) prepares trace data for OpenSearch and enriches and ingests the span documents into a span index within OpenSearch. The second pipeline (service-map-pipeline) aggregates traces into a service map and writes service map documents into a service map index within OpenSearch.

Note that because there are 2 sinks, you need to provide the OpenSearch cluster host IP address and user/password twice (bolded)! Note that I changed the default sink configurations for insecure from false to true, as I was just testing. In production, you will likely set this back to false (The Data Prepper configuration documentation has more details).

You run Data Prepper like this (note the mapping of port 21890):

Note that on a Mac (with Apple Silicon) you will get the following warning, but it still works ok:

If everything is configured correctly then you will get the following message when starting Data Prepper:

Get, configure and run the OpenTelemetry Collector

The final piece of the puzzle is the OpenTelemety Collector. To understand why the Collector is important, we need to backtrack a bit and have a look at the OpenTelemetry architecture. Here’s the high-level diagram, which clearly shows the central role of the Collector:

How to get OpenTelemetry traces from Apache Kafka clients into OpenSearch using Data Prepper and OpenTelemetry Collectors diagram 3

Source: https://opentelemetry.io/docs/

The Collector functions like a spider in a web and is documented further here. The Collector offer a vendor-agnostic implementation of how to receive, process and export telemetry data. It removes the need to run, operate, and maintain multiple agents/collectors. This works with improved scalability and supports open source observability data formats (e.g. Jaeger, Prometheus, etc.), sending to one or more backends (also called frontends in the above diagram). Collectors are complex and support multiple components, including Receivers, Processors and Exporters, as shown in this diagram:

How to get OpenTelemetry traces from Apache Kafka clients into OpenSearch using Data Prepper and OpenTelemetry Collectors diagram 4

I installed a Mac version of the Collector and used these installation instructions.

Next, you need to configure it to send data to Data Prepper according to these instructions (you also need to add otlp: protocols: http). Here’s my configuration file (otel-collector-config.yaml):

And run it like this:

If you get an error about the port being in use, then it’s likely that you already have another Collector running. I did – by accident! It turns out that the version of Jaeger I was using has a Collector built in – which was why it was working without an extra Collector and getting traces directly from the Java clients. Shutting down Jaeger (or running it on a different port) solves the problem and allows the new dedicated Collector to start running with Data Prepper as the sink.

If it’s working correctly, then you will see messages like these:

Finally, go to the OpenSearch Dashboard Trace page and select Data Prepper as the data source. And now the full functionality of the OpenSearch Dashboard Trace Analytics should be available, including the Service Map, which visualizes the dependencies and service topologies as shown in this screenshot:

How to get OpenTelemetry traces from Apache Kafka clients into OpenSearch using Data Prepper and OpenTelemetry Collectors diagram 5

Here’s a diagram of the final setup showing all the components and their relationships:

How to get OpenTelemetry traces from Apache Kafka clients into OpenSearch using Data Prepper and OpenTelemetry Collectors diagram 6

Note that the order of running this pipeline matters. You have to work backwards from the final component to the first in this order for everything to work correctly:

  1. OpenSearch
  2. Data Prepper
  3. OpenTelemetry Collector
  4. Kafka Clients

That’s it for part 2 of this exploration of how to get OpenTelemetry traces into OpenSearch. To summarize:

  • Option 1 (part 1), using OpenSearch as a Jaeger database, works for an experiment with limited trace visualisation functionality.
  • Option 2 is more effort, has more components (Data Prepper and OpenTelemetry Collectors), requires more configuration, but unlocks the full OpenSearch Trace Analytics visualisation functionality, and is more scalable and robust for a production deployment.

In a future article, we’ll explore how to use OpenSearch Observability and Trace Analytics with Apache Kafka in more detail, including what’s missing and what could be improved.