Service Telemetry Prometheus metrics server does not properly report error #11417

omzmarlon · 2024-10-11T22:39:56Z

Describe the bug
While using Prometheus server for service.telemetry.metrics, if the Prometheus server cannot start (e.g. the port used already occupied), OTel framework seems to swallow that error and does not properly report it so that the user/caller can do something about it.

Steps to reproduce
It's pretty easy to reproduce with just very simple setup.

I am using the tag v0.111.0 (latest available at the time of writing this issue)

First I built a otelcol binary using the OTel Builder tool. For builder config I used the config in there repo here with some small adjustments:

dist:
  name: otelcol-custom
  description: Local OpenTelemetry Collector binary
  output_path: /tmp/dist
exporters:
  - gomod: go.opentelemetry.io/collector/exporter/nopexporter latest

receivers:
  - gomod: go.opentelemetry.io/collector/receiver/nopreceiver latest
  
replaces:
  - go.opentelemetry.io/collector => <path to my local cloned latest of this repo>/open-telemetry/opentelemetry-collector/
  # ...
  - go.opentelemetry.io/collector/service => <path to my local cloned latest of this repo>/opentelemetry-collector/service

The config used to run the otelcol binary is this:

receivers:
  nop:
exporters:
  nop:
service:
  pipelines:
    logs:
      receivers:
        - nop
      exporters:
        - nop
  telemetry:
    metrics:
      level: Basic
      address: 8889

Now if I run 2 instances of the otelcol binary using this same config (i.e. both will try to bind to port 8889 for metrics), the logs will be exactly the same:

2024-10-11T14:59:21.948-0700    info    [email protected]/service.go:135 Setting up own telemetry...
2024-10-11T14:59:21.948-0700    info    telemetry/metrics.go:73 Serving metrics {"address": "localhost:8888", "metrics level": "Basic"}
2024-10-11T14:59:21.949-0700    info    [email protected]/service.go:207 Starting otelcol-custom...      {"Version": "1.0.0", "NumCPU": 16}
2024-10-11T14:59:21.949-0700    info    extensions/extensions.go:39     Starting extensions...
2024-10-11T14:59:21.949-0700    info    [email protected]/service.go:230 Everything is ready. Begin running and processing data.

Neither instance's logs do not say anything about port 8889 is already in use.

Since I was using replace in the otelcol builder config, I was able to manually insert some fmt.Printf into the OTel source code to manually print error.

I had basically change this line to this:

		if serveErr := server.ListenAndServe(); serveErr != nil {
			fmt.Printf("%s\n", serveErr)
			if !errors.Is(serveErr, http.ErrServerClosed) {
				select {
				case asyncErrorChannel <- serveErr:
				case <-time.After(1 * time.Second):
				}
			}

Now if I rebuild the otelcol binary and run 2 instances of it again, I do see the address already in use error from 1 of the 2 instances:

2024-10-11T14:59:31.857-0700    info    [email protected]/service.go:135 Setting up own telemetry...
2024-10-11T14:59:31.857-0700    info    telemetry/metrics.go:73 Serving metrics {"address": "localhost:8888", "metrics level": "Basic"}
2024-10-11T14:59:31.858-0700    info    [email protected]/service.go:207 Starting otelcol-custom...      {"Version": "1.0.0", "NumCPU": 16}
2024-10-11T14:59:31.858-0700    info    extensions/extensions.go:39     Starting extensions...
2024-10-11T14:59:31.858-0700    info    [email protected]/service.go:230 Everything is ready. Begin running and processing data.
listen tcp 127.0.0.1:8889: bind: address already in use

I think the cause of this issue is this:

When OTel tries to report the metrics server error to the asyncErrorChannel, the channel is actually not set and is nil. I also manually added some print statements and verified that the asyncErrorChannel param passed to here is indeed nil. This will basically cause the case asyncErrorChannel <- serveErr: to never be executed.

I think the channel is nil because here when it is setting up telemetry.Settings it did not set the AsyncErrorChannel. ()

We actually didn't have this problem when we were using an earlier version of OTel (e.g. v0.108.1) because it did set up the async error channel properly here.

This is quite problematic because if a otel pipeline cannot bind to a port to report metrics, we have no proper way to know it (other than when Prometheus actually starts scraping and we see repeated failures) so that we can retry on a different port.

What did you expect to see?
OTel should either report the metrics server error through the asyncErrorChannel or return the error so that it can be detected at start: Start(ctx context.Context, host Host) error

What did you see instead?
The metrics server error is completely swallowed and there is not even a log about it.

What version did you use?

What config did you use?

Environment
OS: MacOS and Linux
Compiler: go 1.22

Additional context
n/a

The text was updated successfully, but these errors were encountered:

bogdandrutu · 2024-10-12T22:51:13Z

@omzmarlon first thing first, please change to use readers. See https://github.com/open-telemetry/opentelemetry-collector/releases/tag/v0.111.0 deprecation notice.

omzmarlon added the bug Something isn't working label Oct 11, 2024

omzmarlon linked a pull request Jan 17, 2025 that will close this issue

Pass the missing async error channel into telemetry.Settings #12111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service Telemetry Prometheus metrics server does not properly report error #11417

Service Telemetry Prometheus metrics server does not properly report error #11417

omzmarlon commented Oct 11, 2024 •

edited

Loading

bogdandrutu commented Oct 12, 2024

Service Telemetry Prometheus metrics server does not properly report error #11417

Service Telemetry Prometheus metrics server does not properly report error #11417

Comments

omzmarlon commented Oct 11, 2024 • edited Loading

bogdandrutu commented Oct 12, 2024

omzmarlon commented Oct 11, 2024 •

edited

Loading