Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return an http error during scraping if metrics collide when escaped to underscores #1641

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ywwg
Copy link
Member

@ywwg ywwg commented Oct 8, 2024

fixes #1638

This change prevents a possible issue when UTF-8 metrics are scraped by a system that does not support UTF-8. The metric names will be escaped, and if two metrics escape to the same legacy name, they will appear in the exposition "twice", causing collision problems. Therefore, during scraping we return an http error if two metrics escape to the same legacy string.

This change creates a new uint64 per metric and two new maps in the Registry. The new maps are only populated if the hash of the original metric data contains UTF-8 data. This means memory usage will be lower in the near term and will go up in the long term or for OTLP-heavy users. Eventually (i.e. Prom 4.0 or later) we could deprecate the compatibility modes with legacy systems to save this memory, but it's also not clear to me that the memory usage will be that high.

To avoid performance issues checking for collisions on every scrape, the actual collision detection is done at registration time but only reported at scrape time.

@ywwg ywwg changed the title WIP / Sacrificial: By default, return an error if metrics collide when escaped to underscores WIP / strawman: By default, return an error if metrics collide when escaped to underscores Oct 8, 2024
@ywwg
Copy link
Member Author

ywwg commented Oct 9, 2024

So the edge case here is:

  1. collision mode is Compat by default
  2. a UTF-8 metric is created in an init function
  3. the collision mode is flipped to UTF-8
  4. an identical UTF-8 metric is created

This should collide, but because the first metric's hash was Compat mode, it doesn't match the hash of the second metric, and we don't detect the collision.

I think this is another case where we can say "don't do that", but I'm not sure how best to define what "that" is. In general, I think users should not define metrics in init() functions. OR, we could say, don't define UTF-8 metrics in init functions.

@ywwg
Copy link
Member Author

ywwg commented Oct 9, 2024

Another option is to store hashes for both UTF-8 and compat. This adds 64bits per metric and another collectorsByID map (since we'll need to store both the compat and utf8 collector ids depending on the mode). don't love that either

@ywwg
Copy link
Member Author

ywwg commented Oct 9, 2024

I am leaning towards having a second map in the registerer that only checks for compat fqname+labelname collisions. It's a memory increase but it gets around all these issues.

@@ -140,15 +142,45 @@ func (v2) NewDesc(fqName, help string, variableLabels ConstrainableLabels, const
d.err = fmt.Errorf("duplicate label names in constant and variable labels for metric %q", fqName)
return d
}

d.id, d.dimHash = makeHashes(labelNames, labelValues, help, true)
d.compatID, d.compatDimHash = makeHashes(labelNames, labelValues, help, false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this mutates labelNames so it's not pretty

uncheckedCollectors []Collector
pedanticChecksEnabled bool
utf8Collision bool
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're adding a uint64, empty struct, and Collector interface obj for every metric registered. This could be a lot of memory with many millions of metrics. Perhaps we could only store the compat IDs if they are different from the default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I have now implemented this)

@@ -391,6 +481,10 @@ func (r *Registry) Unregister(c Collector) bool {
defer r.mtx.Unlock()

delete(r.collectorsByID, collectorID)
if _, exists := r.collectorsByCompatID[compatID]; !exists {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is other cleanup needed

@ywwg
Copy link
Member Author

ywwg commented Oct 9, 2024

I think this new approach is a decent solution. We can flip UTF-8 mode on and off at will and we don't pay a huge memory or performance penalty (that I can see)

@ywwg ywwg requested a review from ArthurSens October 10, 2024 13:28
prometheus/registry_test.go Show resolved Hide resolved
@@ -57,6 +57,8 @@ type Desc struct {
// must be unique among all registered descriptors and can therefore be
// used as an identifier of the descriptor.
id uint64
// compatID is similar to id, but is a hash of all the relevant names escaped with underscores.
compatID uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first saw this variable I thought it was a typo for compactID, but I see you're are using compact is other places as well.

What does compat means? 😅

Copy link
Member

@ArthurSens ArthurSens Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And still related to this variable, have we thought about using id to store the hash of the already normalized descriptor(instead of creating a new hash)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"compatible." as in, "compatible with legacy systems." I need to pick better names for sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we thought about using id to store the hash of the already normalized descriptor?

do you mean only storing the normalized descriptor? We need both in case the user switches the mode to UTF-8 comparison with no legacy compatibility. Or do you mean something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant just storing the normalized descriptor. Good point that's it's possible to change the validation in-flight!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why someone would like to do that though 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the record (as discussed in slack), it's because metrics are often created in init() functions before the user has a chance to actually configure the correct escaping scheme. So we have to be ready for both scenarios.

@ywwg ywwg changed the title WIP / strawman: By default, return an error if metrics collide when escaped to underscores By default, return an error if metrics collide when escaped to underscores Oct 11, 2024
@ywwg ywwg force-pushed the owilliams/collide branch from b08de89 to 8e9ef8c Compare October 11, 2024 14:41
@ywwg ywwg marked this pull request as ready for review October 11, 2024 14:41
@ywwg ywwg force-pushed the owilliams/collide branch from 97cad93 to 118a3c1 Compare October 11, 2024 14:55
if hasEscapedCollisions {
switch contentType.ToEscapingScheme() {
case model.UnderscoreEscaping, model.DotsEscaping:
opts.ErrorLog.Println("error: one or more metrics collide when escaped")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a better error message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like

"One or more metrics collide when UTF-8 characters in name/label are translated to underscores"?

Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Since we always allow registration regardless of collision, I would just move the test closer to the HTTP package. We can keep the same test cases you've added, but the test would look similar to this:

func TestEscapedCollisions(t *testing.T) {
	reg := prometheus.NewRegistry()
	reg.MustRegister(prometheus.NewCounter(prometheus.CounterOpts{
		Name: "test_metric",
		Help: "A test metric with underscores",
	}))
	reg.MustRegister(prometheus.NewCounter(prometheus.CounterOpts{
		Name: "test.metric",
		Help: "A test metric with dots",
	}))


	
	handler := HandlerFor(reg, HandlerOpts{})
	writer := httptest.NewRecorder()
	request, _ := http.NewRequest("GET", "/metrics", nil)
	request.Header.Add(acceptEncodingHeader, "the header that requires escaping")
	handler.ServeHTTP(writer, request)
	require.Equal(t, "some http error", writer.Code)

	request, _ := http.NewRequest("GET", "/metrics", nil)
	request.Header.Add(acceptEncodingHeader, "the header that allows utf8")
	handler.ServeHTTP(writer, request)
	require.Equal(t, "no erro error", writer.Code)
}

prometheus/registry_test.go Outdated Show resolved Hide resolved
prometheus/registry_test.go Outdated Show resolved Hide resolved
@ywwg
Copy link
Member Author

ywwg commented Oct 16, 2024

I would just move the test closer to the HTTP package.

I'd like to keep the current unit test since the implementation is a little weird, but I will add this too.

@ywwg
Copy link
Member Author

ywwg commented Oct 16, 2024

I'll squash this down one we're done with notes

Help: "A test metric with dots",
}))

handler := HandlerFor(reg, HandlerOpts{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the sucky part, it's necessary to specify the registry here or else we don't detect the collision case (because Gatherer does not know about collisions)

@ywwg ywwg changed the title By default, return an error if metrics collide when escaped to underscores Return an http error during scraping if metrics collide when escaped to underscores Oct 16, 2024
@ywwg ywwg force-pushed the owilliams/collide branch 2 times, most recently from aad8c01 to 7f3763c Compare October 16, 2024 15:51
Copy link
Member

@ArthurSens ArthurSens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Just one comment about adding another dependency, sorry about that 😬

prometheus/registry.go Outdated Show resolved Hide resolved
Comment on lines 1342 to 1349
require.NoError(t, err)
err = reg.Register(tc.counterB())
if !tc.expectErr {
require.NoError(t, err)
} else {
require.Error(t, err)
}
require.Equal(t, tc.expectLegacyCollision, reg.HasEscapedCollision())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for creating the confusion here, that's my bad! Bartek made a good point for not using require here, would you mind reverting the change? 🙈

We always have problems when we introduce new dependencies, let's avoid it if we can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

@ywwg ywwg force-pushed the owilliams/collide branch from 7f3763c to 42825b6 Compare October 17, 2024 15:20
Signed-off-by: Owen Williams <[email protected]>
@@ -134,6 +136,7 @@ func HandlerForTransactional(reg prometheus.TransactionalGatherer, opts HandlerO
}
}
}
hasEscapedCollisions = reg.HasEscapedCollision()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... should I use the singular or the plural? :)

Help: "A test metric with dots",
}))

handler := HandlerFor(reg, HandlerOpts{})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tada

@@ -194,6 +205,10 @@ func (gf GathererFunc) Gather() ([]*dto.MetricFamily, error) {
return gf()
}

func (gf GathererFunc) HasEscapedCollision() bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the most risky change. both mimir and avalanche use gathererfunc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -57,6 +57,9 @@ type Desc struct {
// must be unique among all registered descriptors and can therefore be
// used as an identifier of the descriptor.
id uint64
// escapedID is similar to id, but with the metric and label names escaped
// with underscores.
escapedID uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use id for escaped ID?

}
d.id = xxh.Sum64()
d.escapedID = escapedXXH.Sum64()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why not use id as escaped ID always?

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

I think I would love to understand this change more before committing here. Asked some likely stupid questions on the issue #1638 (comment)

We never did those kind of checks on HTTP handler only, only on registration layer, I wonder if it's worth to keep that. Having your app working fine and suddenly /metrics are erroring only because 2 series collide is bit unexpected (especially by default) and wasteful, scrape could just ignore one problematic metric.

If anything we could start with a special Gatherer that checks that, without changes to descriptor ever, and see who would actually use it (who would like to opt-in for extra safety here). I would also be happy to consider adding this logic on-registry check/error when escaped metric name is colliding OR only check escaped ones (is there a case when escaped don't collide, but non-escaped collide?)

Is it worth to double check those first @ArthurSens ?

@ywwg
Copy link
Member Author

ywwg commented Oct 18, 2024

We are going to put this on hold and have a meeting to discuss the various approaches and their pros and cons

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally return error if two metrics would collide under escaping schemes
3 participants