Reverse Metric Lookup in Prometheus
Hey all, it’s been a while since I’ve written anything, and it’s time to get back in the habit. While I’ve been away, I’ve been digging into my company’s observability systems, and I’m back with lessons to share!
Axon uses Prometheus for collecting metrics and Grafana for visualizing them. Often in Prometheus, we start with a metric and work forwards, filtering and transforming it to produce some useful result that we then display in a dashboard. Sometimes though, I find myself in situations where I want to work backwards from a metric label to a metric that I’m trying to identify.
In this post, I’m going to show y’all a handy trick that you can use to solve this problem.
Dude, Where’s My Metric?
The other day, I had a teammate come to me with a question. He said, (paraphrasing to add context):
I’ve got some Scala code that looks like this:
recordEndpointMetrics(endpoint="myEndpoint") {
// endpoint implementation
// ...
// ...
}
Where recordEndpointMetrics
is an internal library function that records some Prometheus metrics about the invocations of service endpoints, and the endpoint
name is being passed in to be used as a metric label.
How do I figure out what metric on the Prometheus side is mapped to by this function?
In hind sight, the easiest way to answer this question would have been to just look at the method documentation, which said very clearly what the metric name was. However, I’m a little embarrassed to admit that neither of us thought to do that until after we had worked out the answer the hard way 😬.
Still, not all methods contain such helpful, up-to-date documentation (or any documentation for that matter), and so I think it’s still valuable to know some other options that we can use to answer that question.
Metric Queries
A typical PromQL query might look something like this:
sum(queue_size{topic=”my-topic”})
To break down what’s going on here, we start with a base metric like queue_size. We then use curly braces to filter that metric down to some subset of data, and then we apply functions or aggregations, like sum
, to transform the data in some useful way.
This workflow is useful for querying and transforming metrics that we already know about so that we can display them in a dashboard, but it’s not super helpful if we’re still trying to figure out what metrics we want to use in the first place.
A High (Syntax) Sugar Diet
If you’ve every played around with Grafana’s Metric Explorer feature, you may have noticed is that all Prometheus metrics have a __name__
label that contains the name of the metric series.
This detail may not appear particularly remarkable on first glance, but it turns out to be pretty important for us. You see, the metric name isn’t a special or privileged piece of metadata. It’s just another label, and that means we work with it in the same ways that we work other labels.
To be fair, we have special syntax for it:
metric{label=”value”}
But it turns out this is only syntactic sugar that reduces down to
{__name__=”metric”, label=”value”}
Observe:
This is really important for us, because it means that we can do things like regex matches on metrics names:
Or even omit the metric name entirely and just query for metrics with specific labels and values:
Dude, Here’s Your Metric!
Let’s now return to the problem that my co-worker and I were looking at. The first thing that we did was take a peek under the hood at the implementation of recordEndpointMetrics
. We didn’t see the metric name, but we did see a line of code like this one:
val metricAttributes = Attributes.builder().put("api", aspect)
From the method invocation, we know that aspect
has a value of “myEndpoint”, meaning that we’re looking for some metric that has a label/value pair of “api” and “myEndpoint”.
From here, it was pretty trivial to query all metrics that look like that with
{api=”myEndpoint”}
which quickly led us to the solution.
Wrapping Up
Headless metric queries like the ones we’ve looked at in this post are a really great tool for doing metric discovery and for mapping out the landscape of your Prometheus metrics. That, in turn, is invaluable for building operational fluency and being able to quickly figure out what’s happening when you get paged at 2 A.M.
If you’re looking for ways to get started building that fluency, a good place to start might be querying all of the metrics that are produced by a specific service with something like {host=~”my-service-host-.*”}
and seeing where that leads you.
Until next time!