Grafana series (XV): Exemplars

This article was last updated on: February 7, 2024 pm

Introduction to Exemplars

Exemplar is a specific trace that represents a measurement over a given time interval. Metrics excels at giving you a comprehensive view of the system, while traces give you a fine-grained view of a single request; Exemplar is one way to connect the two.

Let’s say your company website is experiencing a spike in traffic. While more than eighty percent of users are able to access a website within two seconds, some users have higher than normal response times, resulting in a poor user experience.

To determine what is causing latency, you must compare a fast-responding trace to a slow-responding trace. Given the large amount of data in a typical production environment, this would be very laborious and time-consuming.

Use Exemplar to help isolate problems in your data distribution by identifying traces of queries that exhibit high latency in one time interval. Once you’ve pinpointed latency issues to a few exemplary traces, you can combine it with other system-based information or location attributes for faster root-cause analysis to quickly resolve performance issues.

Support for ExemplarAvailable only for PrometheusData source. Once you enable this feature, Exemplar data is available by default.

Grafana displays Exemplar along with metrics in the Explore view and dashboard. Each Exemplar appears as a highlighted star. You can hover over Exemplar to see the unique traceID, which is a combination of key-value pairs. For further analysis, click the blue button next to the “traceID” attribute. Examples are as follows:

截图显示了一个 exemplar 的详细窗口

background

Exemplars are a hot topic in the observability space lately, and for good reason.

and Prometheus How to break the cost structure of large-scale storage metrics in 2012 and actually achieve it in 2015, as well Grafana Loki Similar to the cost structure of how to break the mass storage log in 2018, Exemplar is doing the same with trace. To understand why, let’s look at the history of observability in cloud-native ecosystems and what optimizations Exemplar was able to achieve.

At its core, Exemplar is a way to jump from meaningful metrics and logs to traces by ID.Grafana Tempo, Grafana Labs Open source, massively distributed tracing backend, built around this idea, because Exemplar makes the cost and performance characteristics of distributed tracing good. Ideally, you’ll never need to sample your traces, and Tempo makes that happen.

Prometheus

Ignoring Prometheus’ excellent scalability, compressibility, and performance for a moment, let’s focus on the tag set. They are metadata about your time series. What cluster, what service, which customer, what level of deployment, etc. can be coded in non-hierarchical key-value pairs. If you’re reading this, I probably don’t need to convince you how disruptive, impactful, and persistent changes in this industry are; I just wanted to remind you because it has to do with the rest of the text.

This was revolutionary a few years ago:

acme_http_router_request_seconds_sum{path="/api/v1",method="GET"} 9036.32
acme_http_router_request_seconds_count{path="/api/v1",method="GET"} 807283.0
acme_http_router_request_seconds_sum{path="/api/v2",method="POST"} 479.3
acme_http_router_request_seconds_count{path="/api/v2",method="POST"} 34.0

OpenMetrics

As early as 2015-2016, developers planned to use the same set of tags for logging and tracing. Here’s why OpenMetrics Since 2017 has been in a situation called OpenObservability GitHub, not “just” an organization called OpenMetrics.

Grafana Loki

With Loki, that dream came true in 2018. Move seamlessly between your metrics and logs with no issues. This is where the slogan “Like Prometheus but for logs” came from.

This forces us to apply tag sets to trace, right?

OpenMetrics & OpenCensus

In 2017, OpenMetrics and OpenCensus met to see if the two projects could be merged. Although it was unsuccessful due to incompatible design goals, operating models, and data models, the conference changed the fortunes of OpenMetrics and Prometheus, and led to Grafana Tempo’s core design.

Exemplars design ideas

Essentially, Exemplar is all about three ideas:

  1. Tightly combine trace with other observability data.
  2. Jump into trace only by ID.
  3. Only jump into a trace if you know which trace you’re interested in and why. Avoid “frequent jump-in and out”.

Tight bonding

Appending a trace ID to an indicator via exemplars is very simple. Add a “#” after your measure (and possibly a timestamp) to indicate that an exemplars exists, and then add your data.

borrow OpenMetrics specification Examples in :

# TYPE foo histogram
foo_bucket{le="0.01"} 0
foo_bucket{le="0.1"} 8 # {} 0.054
foo_bucket{le="1"} 11 # {trace_id="KOO5S4vxi0o"} 0.67
foo_bucket{le="10"} 17 # {trace_id="oHg5SJYRHA0"} 9.8 1520879607.789
foo_bucket{le="+Inf"} 17
foo_count 17
foo_sum 324789.3
foo_created  1520430000.123

Iftrace_idThe name and value of the label remind you W3C Distributed Tracing Working Group The proposed norm, then it is no coincidence. We deliberately adopted the W3C specification without enforcing it. This allows us to build on existing specification work while not bundling OpenMetrics until the distributed tracing space stabilizes.

Let’s look at the practical example inside:

Histogram bins with a display latency of less than 1 second have a run time of 0.67 seconds and an ID of KOO5S4vxi0otrace.

Histogram bins that show a delay of less than 10 seconds have a trace that runs for 9.8 seconds1520879607.789, ID is oHg5SJYRHA0

That’s it!

ID only

Indexes are expensive. Putting the full context and metadata on trace means that you need to search for traces through them, which means indexing them. But you want to have your metrics, logs, and traces (as well conprof, crashdumps, etc.) But since you already have this metadata on other data, how about reusing the same index to save cost and time?

You can do this by attaching a trace to a specific time series or log at a specific point in time. For trace itself, you just index the ID and you’re good to go.

Only traces of interest

Automated tracking analytics is a broad field; A great deal of engineering effort was put into making this haystack searchable.

What if there was a cheaper and more efficient way?

Logs can already tell you an error status or something similar. You don’t need to analyze trace to find that error.

Counters, histograms, etc. in indicators are already a highly condensed and optimized form of data, distilled into something important in this case. You don’t need to analyze all traces to find the one that shows high latency.

Your logs and your metrics have told youWhyA trace is something that needs to be investigated in depth. Your tags give you context on how and where to generate traces. When you jump into trace, you already know what you’re looking for and why. This greatly speeds up discovery.

Prometheus enables the Exemplar storage feature

📚️ Reference:
Exemplars storage | Prometheus Doc

1
--enable-feature=exemplar-storage

OpenMetrics Describes the ability of scraping targets to add Exemplars to certain metrics. A typical scenario is a reference to data outside of a MetricSet. A common use case is trace ID.

Exemplar storage is implemented as a fixed-size circular buffer that stores all series of exemplar in memory. Enabling this feature will make it possible to store Exemplar scraped by Prometheus. Configure the configuration file block storage/exemplars Can be used to control the size of the circular buffer. One onlytraceID=<jaeger-trace-id>The exemplar uses approximately 100 bytes of memory through the exemplar store in memory. If exemplar storage is enabled, we will also append exemplar to the WAL for local persistence (for the duration of the WAL).

Configure Exemplar in the Prometheus data source

📚️ Reference:

For more information about Exemplar configuration and how to enable Exemplar, see Configure Exemplar in the Prometheus data source

📝 Notes:

This feature is available on Prometheus 2.26+ and Grafana 7.4+.

Grafana 7.4 and later is able to display metrics related Exemplar data in Explore and dashboards. Exemplar data is a way to tie high-weight metadata from a specific event to traditional time series data.

Configure Exemplars in the data source settings by adding external or internal links.

Exemplars 配置截图

View Exemplar data

📚️ Reference:

Please refer to it View exemplar dataLearn how to drill into and view Exemplar trace details from metrics and logs.

When prometheus data sources enable exemplar support, you can view exemplar data in Explore view or from Loki log details.

Explore

Explore visualizes Exemplar’s tracking data as highlighted star and metric data. For more information on how Explore visualizes trace data, refer to Tracking in Explore

To check the details of the Exemplar trace.

  1. Place your cursor on an exemplar (highlighted star). Depending on your backend trace data source, you’ll see a blue button labeled Yes Query with <DataSource Name>。 In the following example, the data source for Trace is Tempo.

    显示 Exemplar details 的截图

  2. Click the Query with Tempo option next to the traceID property. Details of Trace, including spans in trace, are listed in a separate panel on the right.

    带有显示 trace 细节面板的 Explore 视图

Logs

You can also view the exemplar trace details in Loki logs in Explore. Use regex in Loki’s Derived fields link to extract traceID information. Now when you expand the Loki log, you can do it in the Detect fieldssection sees the traceID property. To learn more about how to extract a portion of log information into internal or external links, refer to Use derived fields in Loki

To view the details of an Exemplar trace:

  1. Expand a log line and scroll down to the Detected fields section. Depending on your backend tracking data source, you’ll see a blue button labeled Yes< 数据源名称 >

  2. clicktraceIDA blue button next to the property. Typically, it will have the name of the back-end data source. In the following example, the tracked data source is Tempo. The details of the trace, including the span in the trace, are listed in a separate panel on the right.

带有显示跟踪细节面板的 Explore 视图

summary

Exemplars is just that. Engineering is always a trade-off to accommodate design goals and constraints.

Prometheus shifts the entire industry to a new set of trade-offs, creating the cornerstone of cloud-native visibility. Grafana Loki is doing the same logging work. Grafana Tempo is doing this for distributed tracing through the power of exemplars.

Tempo’s job is to store a large number of traces, put them in object storage, and retrieve them by ID. Since all of this follows a holistic design, seamless movement between metrics, logs, and traces is already possible, and at true cloud-native scale.

Metrics logs  traces 无缝移动的具体软件实现

Metrics logs  traces 无缝移动的具体技术细节实现

Exemplars already Supported in Grafana starting with 7.4

Reference documentation