Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaxTime is set to a too large number when doing larger latency requests? #7319

Open
wiardvanrij opened this issue May 1, 2024 · 1 comment

Comments

@wiardvanrij
Copy link
Member

wiardvanrij commented May 1, 2024

I'm having a vague problem, but I will try my best to explain what is happening.

From time to time, I'm getting this error:

rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR

Basically I'm seeing a bunch of http2 connection resets.

When debugging, I'm using the Thanos Querier directly and I have been able to reproduce this behaviour under a very specific set of variables.

  1. This only seems to happen against one of our larger Prometheus instances
  2. We have Prometheus instances over multiple regions and the Thanos querier on a fixed region
  3. When the Querier isn't in the same region as the largest Prometheus instance, it's 100% reproducible. When I moved the Querier to the same region as the Prometheus (but still talk to it via an ingress), it somehow magically isn't reproducible anymore.
  4. The range of the query seems to have an impact on it. Doing a 2 hour query works fine. 6 hours = breaks.

I have been somewhat starring blind on this, while eventually it hit me.

Looking deeper on an error query like, it actually shows more details. It's a 422 Unprocessable Entity response code.

This I can explain, as due to the error message.

Let me show you the query in question:

query: count(sum by (namespace, cluster) (kube_pod_info{})) by (cluster)
dedup: true
partial_response: false
storeMatch[]: {__address__="thanos-{redacted}.thanos.svc.cluster.local:10901"}
start: 1714550267.884
end: 1714564667.884
step: 57
max_source_resolution: 0s
engine: prometheus
explain: false

Response:

{
    "status": "error",
    "errorType": "execution",
    "error": "expanding series: proxy Series(): rpc error: code = Aborted desc = receive series from Addr: thanos-{redacted}.thanos.svc.cluster.local:10901 LabelSets: {} MinTime: 1714543044113 MaxTime: 9223372036854775807: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR"
}

Which make sense, as the MaxTime is 9223372036854775807 which isn't a legit value. However, as you can see, I make the query with a end: 1714564667.884 argument.

This only seems to happen when the query reaches a certain latency response. If I would only query the last 2 hours instead of 6 hours, it's "fine". Like-wise, if I move my querier closer to the Prometheus in question, I can query the last 6 hours just fine.

Somehow, and I don't know why/how, when the latency/response becomes too long, it magically ruins my maxtime to an invalid value?

@MichaHoffmann
Copy link
Contributor

MichaHoffmann commented May 1, 2024

It must be something other then the maxtime of the range request being messed with. That logging is the String representation of the endpoint ref and corresponds to the time range of the store, see

func (er *endpointRef) String() string {
. Do you see errors on the matching store?

Edit: The max time comes probably from here if its a sidecar:

m.UpdateTimestamps(minTime, math.MaxInt64)
. I think the logic is that sidecars will not get filtered out if you request the latest data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants