Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout to endpointset metric collector #7336

Merged
merged 1 commit into from May 21, 2024

Conversation

fpetkovski
Copy link
Contributor

We have seen deadlocks with endpoint discovery caused by the metric collector hanging and not releasing the store labels lock. This causes the endpoint update to hang, which also makes all endpoint readers hang on acquiring a read lock for the resolved endpoints slice.

This commit makes sure the Collect method on the metrics collector has a built in timeout to guard against cases where an upstream call never reads from the collection channel.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

We have seen deadlocks with endpoint discovery caused by the metric
collector hanging and not releasing the store labels lock. This causes
the endpoint update to hang, which also makes all endpoint readers hang on
acquiring a read lock for the resolved endpoints slice.

This commit makes sure the Collect method on the metrics collector has
a built in timeout to guard against cases where an upstream call never
reads from the collection channel.

Signed-off-by: Filip Petkovski <[email protected]>
@@ -277,7 +279,12 @@ func (c *endpointSetNodeCollector) Collect(ch chan<- prometheus.Metric) {
lbls = append(lbls, storeTypeStr)
}
}
ch <- prometheus.MustNewConstMetric(c.connectionsDesc, prometheus.GaugeValue, float64(occurrences), lbls...)
select {
case ch <- prometheus.MustNewConstMetric(c.connectionsDesc, prometheus.GaugeValue, float64(occurrences), lbls...):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this can take even 1 second? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't yet understand why the send would block forever here, the only explanation is that the caller does not read from the channel. The 1s timeout is arbitrary though, it's there to make sure we do not end up in a deadlock.

We noticed queriers getting stuck occasionally and the goroutine profile showed that the mutex in this function was constantly locked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something else might be at play here, e.g., this goroutine is starved of CPU resources. Do you have a lot of goroutines running when this happens?

Copy link
Contributor Author

@fpetkovski fpetkovski May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only 1 goroutine here but there are new goroutines constantly piling up here. That is because the mutex for the collector is locked, which causes the call here to hang so this mutex also is locked.

I do not yet understand why the channel reader would stop or block though.

Copy link
Contributor

@MichaHoffmann MichaHoffmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a bug in the prometheus client if this is a new thing. The fix is practical though, lets merge and debug what happens later.

@fpetkovski
Copy link
Contributor Author

I will merge this for now and we can dig into the root cause once we get some time.

@fpetkovski fpetkovski merged commit 9db31c2 into thanos-io:main May 21, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants