MaxTime is set to a too large number when doing larger latency requests? #7319

wiardvanrij · 2024-05-01T12:17:36Z

I'm having a vague problem, but I will try my best to explain what is happening.

From time to time, I'm getting this error:

rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR

Basically I'm seeing a bunch of http2 connection resets.

When debugging, I'm using the Thanos Querier directly and I have been able to reproduce this behaviour under a very specific set of variables.

This only seems to happen against one of our larger Prometheus instances
We have Prometheus instances over multiple regions and the Thanos querier on a fixed region
When the Querier isn't in the same region as the largest Prometheus instance, it's 100% reproducible. When I moved the Querier to the same region as the Prometheus (but still talk to it via an ingress), it somehow magically isn't reproducible anymore.
The range of the query seems to have an impact on it. Doing a 2 hour query works fine. 6 hours = breaks.

I have been somewhat starring blind on this, while eventually it hit me.

Looking deeper on an error query like, it actually shows more details. It's a 422 Unprocessable Entity response code.

This I can explain, as due to the error message.

Let me show you the query in question:

query: count(sum by (namespace, cluster) (kube_pod_info{})) by (cluster)
dedup: true
partial_response: false
storeMatch[]: {__address__="thanos-{redacted}.thanos.svc.cluster.local:10901"}
start: 1714550267.884
end: 1714564667.884
step: 57
max_source_resolution: 0s
engine: prometheus
explain: false

Response:

{
    "status": "error",
    "errorType": "execution",
    "error": "expanding series: proxy Series(): rpc error: code = Aborted desc = receive series from Addr: thanos-{redacted}.thanos.svc.cluster.local:10901 LabelSets: {} MinTime: 1714543044113 MaxTime: 9223372036854775807: rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR"
}

Which make sense, as the MaxTime is 9223372036854775807 which isn't a legit value. However, as you can see, I make the query with a end: 1714564667.884 argument.

This only seems to happen when the query reaches a certain latency response. If I would only query the last 2 hours instead of 6 hours, it's "fine". Like-wise, if I move my querier closer to the Prometheus in question, I can query the last 6 hours just fine.

Somehow, and I don't know why/how, when the latency/response becomes too long, it magically ruins my maxtime to an invalid value?

The text was updated successfully, but these errors were encountered:

MichaHoffmann · 2024-05-01T16:56:21Z

It must be something other then the maxtime of the range request being messed with. That logging is the String representation of the endpoint ref and corresponds to the time range of the store, see

thanos/pkg/query/endpointset.go

Line 860 in 1e745af

func (er *endpointRef) String() string {

. Do you see errors on the matching store?

Edit: The max time comes probably from here if its a sidecar:

thanos/cmd/thanos/sidecar.go

Line 375 in 1e745af

m.UpdateTimestamps(minTime, math.MaxInt64)

. I think the logic is that sidecars will not get filtered out if you request the latest data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MaxTime is set to a too large number when doing larger latency requests? #7319

MaxTime is set to a too large number when doing larger latency requests? #7319

wiardvanrij commented May 1, 2024 •

edited

MichaHoffmann commented May 1, 2024 •

edited

MaxTime is set to a too large number when doing larger latency requests? #7319

MaxTime is set to a too large number when doing larger latency requests? #7319

Comments

wiardvanrij commented May 1, 2024 • edited

MichaHoffmann commented May 1, 2024 • edited

wiardvanrij commented May 1, 2024 •

edited

MichaHoffmann commented May 1, 2024 •

edited