Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate metrics vs composability #901

Open
zorba128 opened this issue Dec 2, 2023 · 11 comments
Open

Duplicate metrics vs composability #901

zorba128 opened this issue Dec 2, 2023 · 11 comments

Comments

@zorba128
Copy link

zorba128 commented Dec 2, 2023

Hi there

I started upgrading our (rather big and complex) project to new prometheus client, and got seriously stuck.

Our monitoring includes small service spawned on each physical host, that monitors (using some low-level inter-process api) services deployed nearby. Then results are exposed as metrics, each tagged properly with identity of service it comes from.

So we have our class MyCollector(target: ServiceRef) that scrapes all the metrics related to given service instance, it used to be instantiated and registered several times. Now that is impossible, cause both registry, and MetricSnapshots do not allow merging.

I come from functional world, and design like this seems really odd for me. To implement workaround I need to rewrite MyCollector to internally handle multiple services and do the aggregation internally. I believe this brings unnecessary complexity to otherwise well defined and nice component.

And technically speaking - I'd really prefer MetricSnapshots.builder to keep lookup table, and merge once duplicate metrics is added. Simpe to implement, it keeps MetricSnapshots semantic (in the end still there is a guarantee that there are no duplicates), but makes api way more flexible for scenarios like mine.

Similarly - registry behavior. If you look at code - there is nothing that really prevents it from allowing to register multiple collectors that expose same metrics. With merging builder it would work correctly, directly making scenarios like mine possible.

See my (sorry for scala) implementation of merging builder, that now I need to inject at all layers:

  /** Builder, allowing to merge metrics reported in separate snapshots into one.
   * Its like MetricSnapshots.builder(), but trying to merge rather than failing.
   * Picks metadata from first occurrence of given metrics.
   */
   class MetricSnapshotsMergingBuilder {
    val counters = mutable.Map.empty[String, (MetricMetadata, JArrayList[CounterDataPointSnapshot])]
    val gauges = mutable.Map.empty[String, (MetricMetadata, JArrayList[GaugeDataPointSnapshot])]

    /** Append another snapshot */
    def metricSnapshot(snapshot: MetricSnapshot): this.type = {
      snapshot match {
        case snapshot: CounterSnapshot =>
          val points = counters.getOrElseUpdate(snapshot.getMetadata.getPrometheusName, snapshot.getMetadata -> new JArrayList)._2
          points.addAll(snapshot.getDataPoints)
        case snapshot: GaugeSnapshot =>
          val points = gauges.getOrElseUpdate(snapshot.getMetadata.getPrometheusName, snapshot.getMetadata -> new JArrayList)._2
          points.addAll(snapshot.getDataPoints)
        case ...
      }
      this
    }

    def build(): MetricSnapshots = {
      val builder = MetricSnapshots.builder()
      counters.values.foreach { case (metadata, points) => builder.metricSnapshot(new CounterSnapshot(metadata, points)) }
      gauges.values.foreach { case (metadata, points) => builder.metricSnapshot(new GaugeSnapshot(metadata, points)) }
      builder.build()
    }
  }

It would be event nicer if MetricSnapshot.Builder allowed to inject ready-made metadata - then lookups would become
Map[String, MetricSnapshot.Builder]

I believe such change would make it way easier for everyone to migrate; in old version such scenarios were possible, and were working fine, and now to be honest I see no reason to prevent it - as it can still be handled nicely, still providing guarantees about MetricSnapshots consistency, and consistency of output in the very end.

@fstab
Copy link
Member

fstab commented Dec 2, 2023

Hi @zorba128, thanks a lot for reaching out. Do you think the GaugeWithCallback or CounterWithCallback would help in your case? Here's an example from the JvmMemoryMetrics:

GaugeWithCallback.builder()
        .name("jvm_memory_bytes_used")
        .help("Used bytes of a given JVM memory area.")
        .unit(Unit.BYTES)
        .labelNames("area")
        .callback(callback -> {
            callback.call(memoryBean.getHeapMemoryUsage().getUsed(), "heap");
            callback.call(memoryBean.getNonHeapMemoryUsage().getUsed(), "nonheap");
        })
        .register();

Of course you can define the callback explicitly:

Consumer<GaugeWithCallback.Callback> gaugeCallback = callback -> {
    // get value1 and value2 via the inter-process api
    callback.call(value1, "label value 1", "label value 2");
    callback.call(value2, "label value a", "label value b");
};

Then you can register a GaugeWithCallback with that callback:

GaugeWithCallback.builder()
        .name("my_gauge")
        .labelNames("label name 1", "label name 2")
        .callback(gaugeCallback)
        .register();

Please let me know if that helps.

@zorba128
Copy link
Author

zorba128 commented Dec 2, 2023

As an example of what I mean, please take a look at JvmThreadsMetrics.

And now imagine you want to expose jvm thread stats from multiple jvms. One might think its simple - just have multiple JvmThreadsMetrics instances, each bound to correct MBean. But this will not work, because of built-in design decision that "given metric can be produced by one and only one collector instance".

So developers, rather than simply wiring what is already there - need to build the "multi source merging" logic into each of praticular collectors.

Again - like trying to provide FolderStatsCollector, that monitors number of files and their sizes in specific folder.
Note how expressive code below is:

class FolderStatsCollector(root: Path) { /* list files, expose total size, use du, etc */ }
registry.register(new FolderStatsCollector("/usr").withLabel("folder", "usr"))
registry.register(new FolderStatsCollector("/home").withLabel("folder", "home"))
registry.register(new FolderStatsCollector("/dev/shm").withLabel("folder", "shm"))

But now this is impossible. One needs to rebuild FolderStatsCollector to allow it to scrape multiple folders. This also requires way to inject labels.

record MonitoredPathDef(root: Path, nameLabel: String)
class FolderStatsCollector(paths: List<MonitoredPathDef>) {...}

This responsibility is in wrong place....

See how some of collectors from Jvm monitoring package would get simplified; some of them take list of mbeans already. Rather than that, one could focus on sole collector logic - just get data from one source and expose it. And then its one-liner to apply it for all the mem pools, etc.

To sum up.
I really like changes in new version of client; using immutable structures is a good direction.
But adding some simple operators (like relabeling), and changing metrics aggregation semantics (from "fail on duplicates" to "merge if valid and possible") would turn this api into nice algebra, that makes lots of things possible and easy both to implement, and to reason about.

If anyone is interested, you can take a look at my another comment in #902. I believe such update (although harder to implement without breaking compatibility with existing code) would allow to simplify collector logic to single class. And this in turn makes it possible to easily create transformers, what now would be a bit of pain, cause one needs to handle Collector and MultiCollector separately.

marcin

@fstab
Copy link
Member

fstab commented Dec 2, 2023

Thanks a lot for your feedback Marcin, I appreciate your input.

I'm open to extending the API, but I'm still trying to understand if the current API can be used for what you want to achieve.

I like your FolderStats example, and as this is best discussed with code I made a little example:

static class FolderStats {

    final String path;

    public FolderStats(String path) {
        this.path = path;
    }

    public long getDiskUsageInBytes() {
        long diskUsage = 134; // TODO: run something like `du -b $path`.
        return diskUsage;
    }

    public int getNumberOfFiles() {
        int nFiles = 7; // TODO: run something like `find $path | wc -l`
        return nFiles;
    }

    public String getPath() {
        return path;
    }
}

public void registerFolderStats() {

    FolderStats usrStats = new FolderStats("/user");
    FolderStats homeStats = new FolderStats("/home");

    GaugeWithCallback.builder()
            .name("disk_usage_bytes")
            .unit(Unit.BYTES)
            .labelNames("path")
            .callback(callback -> {
                callback.call(usrStats.getDiskUsageInBytes(), usrStats.getPath());
                callback.call(homeStats.getDiskUsageInBytes(), homeStats.getPath());
            })
            .register();

    GaugeWithCallback.builder()
            .name("file_count")
            .labelNames("path")
            .callback(callback -> {
                callback.call(usrStats.getNumberOfFiles(), usrStats.getPath());
                callback.call(homeStats.getNumberOfFiles(), homeStats.getPath());
            })
            .register();
}

I like the separation of concerns here, as FolderStats (the "business logic") is independent of the metrics library.

What do you think?

@zorba128
Copy link
Author

zorba128 commented Dec 3, 2023

Ah, I see your point here. Thanks for comment.

Notice - the composition of multiple input paths is at the level of specific metric. It needs to be aware if one or more folders are to be monitored. And now imagine responsibility separation

  • admin is to decide what (which paths) to monitor
  • developer is to be decide how do to it (which metrics are to be exposed, how is all that wired)

If developer adds another metrics - you need another XXXWithCallback metrics defined.
If admin adds another folder - each of metrics defined above needs to be updated.
Of course it is possible to define monitored folders as collection, and call paths.forEach(path -> callback.call(...)).
But this code needs to be repeated for every metrics. So again developer needs to be aware who, and how will use his code.

Please compare it to (again, sorry for scala)

  class FolderStats(path: Path) {
    def getDiskUsageInBytes() = 123
    def getNumberOfFiles() = 234
  }

  // actually this could be simply `new UnionCollector(diskUsage, fileCount)`
  // UnionCollector is generic and can be part of library...
  class FolderStatsCollector(path: Path) extends MultiCollector {
    private val folderStats = new FolderStats(path)

    private val diskUsage = GaugeWithCallback.builder()
      .name("disk_usage_bytes")
      .callback(_.call(folderStats.getDiskUsageInBytes()))
      .build()

    private val fileCount = GaugeWithCallback.builder()
      .name("file_count")
      .callback(_.call(folderStats.getNumberOfFiles()))
      .build()

    override def collect(): MetricSnapshots = {
      MetricSnapshots.builder()
        .metricSnapshot(diskUsage.collect())
        .metricSnapshot(fileCount.collect())
        .build()
    }
  }

  registry.register(new FolderStatsCollector(Paths.get("/dev/shm")).withLabel("path", "shm"))
  registry.register(new FolderStatsCollector(Paths.get("/home")).withLabel("path", "home"))

This way:

  • if developer adds another metrics to monitored folder (say - most recent touch date of any of of files) - he adds it to FolderStatsCollector (and underlying FolderStats)
  • if admin adds another monitored folder - he just adds line where collectors are registered
  • if someone is to monitor just one folder - he uses it as any other ready-made collecor

This separation seems theoretical, but I currently face it. I have one project responsible for attaching to process and translating low level data to metrics. Then someone decided it makes sense to expose all that, for multiple services deployed on single host from within single metrics endpoint. This should be a change only to server that exposes metrics, while now, trying to use new client - it turns our every subproject needs to upgrade its low-level metrics collection code to be aware in the end someone will try to publish it merged...

I'm just finishing my custom implementation of CollectorRegistry (already implemented things like collector.appendLabels, merging MetricSnapshots builder, collector union, etc) and it will all work as I described, but thought maybe it would be worth share my thoughts.

Main problem, compared sample with folder monitoring - is that I don't know what metrics will be produced by low level adapters. I just want to pick them, and expose as metrics endpoints.

In the end - some cosmetics - to me callback is just some kind of specialized metric data point builder, so I'd see it more like

     .dataPointBuilder(builder -> {
                builder.add(usrStats.getNumberOfFiles(), usrStats.getPath());
                builder.add(homeStats.getNumberOfFiles(), homeStats.getPath());
            })

I'd pick one of dataPointBuilder/dataPointProvider/poller/... Matter of taste, but callback.call signals something is probably wrong with wording :).

marcin

@zorba128
Copy link
Author

zorba128 commented Dec 4, 2023

After spending few hours on my code, just take a look how I'd try to design it
(scala, but should get the point):

// part of library
// collector interface should have one and only method; remaining are just for convenience of implementers,
// and with modular design noone will ever need to implement it directly
trait Collector {
  def collect(nameFilter: String => Boolean, scrapeRequest: PrometheusScrapeRequest): MetricSnapshots
}

// part of library
object CompositeCollector {
  def builder(): Builder = new Builder
  class Builder {
    def add(c: Collector): Builder = ???
    def build(): Collector = ???
  }
}

// user code - business logic
class FolderStats(path: Path) {
  def getDiskUsageInBytes() = 123
  def getNumberOfFiles() = 234
}

// user code - wiring
object FolderStatsCollector {
  def build(path: Path): Collector = {
    val folderStats = new FolderStats(path)
    CompositeCollector.builder()
      .add(Gauge.builder()
        .name("disk_usage_bytes")
        .callback(_.add(folderStats.getDiskUsageInBytes()))
        .build())
      .add(Gauge.builder()
        .name("file_count")
        .callback(_.add(folderStats.getNumberOfFiles()))
        .build())
      .build()
  }
}

// user code - final consumer
object MyService {
  val registry = CompositeCollector.builder()
    .add(FolderStatsCollector.build(Paths.get("/dev/shm")).withLabel("path", "shm"))
    .add(FolderStatsCollector.build(Paths.get("/home")).withLabel("path", "home"))
    .add(JvmMetrics.builder().build())
    .build()
  PrometheusHttpServer.create(123, registry)
}

Please consider this code in relation to #902 #903 #904. I just wanted to sum this up somehow - I just started looking at v1.1, so it took me a while to figure out how to use it and what design it has.

@fstab
Copy link
Member

fstab commented Dec 5, 2023

Thanks a lot for your thoughts, highly appreciated. Sorry for not responding yet, but I put it on top of my todo list for tomorrow.

@fstab
Copy link
Member

fstab commented Dec 7, 2023

Thanks for your patience. I think I understand your approach.

One thing in the current implementation that prevents this is that metric names must be unique: You cannot register multiple collectors that produce the same metric name.

As always in software design, there's a trade-off: Checking for name conflicts early allows us to fail fast. In most cases we can fail already during registration and don't need to wait for the scrape. With your design, registering collectors that produce the same metric name should be allowed as long as those metrics produce different label values. The downside is that this does not allow a fail-fast approach on metric name conflicts.

As you already pointed out, the code sample above can be refactored to use a List of FolderStats to avoid hard-coding the folder stats into the metrics:

List<FolderStats> stats = new ArrayList<>();
stats.add(new FolderStats("/user"));
stats.add(new FolderStats("/home"));

GaugeWithCallback.builder()
        .name("file_count")
        .labelNames("path")
        .callback(callback -> stats.forEach(s -> callback.call(s.getDiskUsageInBytes(), s.getPath())))
        .register();

GaugeWithCallback.builder()
        .name("file_count")
        .labelNames("path")
        .callback(callback -> stats.forEach(s -> callback.call(s.getNumberOfFiles(), s.getPath())))
        .register();

Do you think it would help to make this easier? It would be good to have both, making sure that each metric name gets registered only once, but also making it easy to add new FolderStats objects.

@zorba128
Copy link
Author

zorba128 commented Dec 7, 2023

About the requirement of preventing to register collector producing given metrics two times - as friend of mine says - "what problem are you trying to solve with it?" Why do you think its needed to do it?

I think this comes from another questionable design decision - to expose global collector registry.
This makes problems like that appear - because the collector lifecycle is unclear.

And last example to prove this is wrong.

Say apart from JvmMetrics, you create sth like HttpMetrics - for tracking REST calls. It would contain
request_count (by method, request type, response status), request_time histogram, other useful utils.
So you publish your prometheus-metrics-instrumentation-http project so that everyone can use it.
With plugins to support OkHttp, native java http client, etc. It turns out really useful, people start using it in their projects.

And now someone develops "github-monitor" project, that somehow observes issue list. With rest calls, he wants to know what is going on - so he adds prometheus-metrics-instrumentation-http dependency, instantiates it, and viola - ready to graph what is going on.

And in the end someone else is developing his project to generate issue progress stats for management. Found ready-made github-monitor, added few calls to you company's internal api, and with few lines of code - he has solution working in few hours. Then he, found there is ready-made github-monitor-prometheus, so he enabled it. Then he wants also his company's api monitored - found prometheus-metrics-instrumentation-http (great, already in dependencies!), so wrote code like:

 new GithubMonitorPrometheus(...).register(registry)
 new MyProjectMetrics(...).register(registry) // which under the hood uses prometheus-metrics-instrumentation-http
 new PrometheusHttpServer(1234, registry)

And it crashes there is no way to make it work.

You know why you do not see this issue with prometheus-metrics-instrumentation-jvm? Cause by definition it makes no sense to have two instances of jvm metrics in single java virtual machine. Unless you remotely connect to other jvms...

But this does not hold for anything else.

m.

zorba128 pushed a commit to zorba128/prometheus_client_java that referenced this issue Dec 8, 2023
zorba128 added a commit to zorba128/prometheus_client_java that referenced this issue Dec 8, 2023
@ndimitry
Copy link

ndimitry commented May 21, 2024

@fstab at the moment all Callback metrics (GaugeWithCallback) are rather getting fixed values (an example in javadocs will collect memory usage once during creation of the Guage) and will return it always.

I think a callable with Functional Interface like this one would make more sense:

    @FunctionalInterface
    public interface Callback {
        void call(DoubleSupplier supplier, String... labelValues);
    }

Then collecting metrics would look like this:

    // ...
    callback.accept((value, labelValues) -> {
        dataPoints.add(new GaugeSnapshot.GaugeDataPointSnapshot(
                supplier.getAsDouble(), makeLabels(labelValues), null, 0L));
    });
    return new GaugeSnapshot(getMetadata(), dataPoints);

If you see this as a useful feature, I may add a pull request for this.
I will anyway implement such a thing for a project I'm working on. It would be a bit more readable then lambda inside a lambda (i.e. lambda inside the Callable) which is called in a lambda (in the collect() method call) : )

@fstab
Copy link
Member

fstab commented May 21, 2024

Callback metrics (GaugeWithCallback) are rather getting fixed values (an example in javadocs will collect memory usage once during creation of the Guage) and will return it always.

I think this is a misunderstanding. The example from the Javadoc will update the value every time metrics are scraped.

GaugeWithCallback.builder()
    .name("jvm_memory_bytes_used")
    .help("Used bytes of a given JVM memory area.")
    .unit(Unit.BYTES)
    .labelNames("area")
    .callback(callback -> {
        callback.call(memoryBean.getHeapMemoryUsage().getUsed(), "heap");
        callback.call(memoryBean.getNonHeapMemoryUsage().getUsed(), "nonheap");
    })
    .register();

The code is literally copy-and-paste from the JvmMemoryMetrics used in JvmMetrics.

If you want to try it, run the example in examples/example-exporter-httpserver/, and query http://localhost:9400/metrics?name[]=jvm_memory_used_bytes. You'll see different values every time you scrape (at least for nonheap, the heap doesn't change much).

@ndimitry
Copy link

@fstab you are totally right. Something clashed in my mind when I was writing my comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants