Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.35: Panic with query mode distributed #7328

Open
jkroepke opened this issue May 2, 2024 · 1 comment
Open

0.35: Panic with query mode distributed #7328

jkroepke opened this issue May 2, 2024 · 1 comment

Comments

@jkroepke
Copy link

jkroepke commented May 2, 2024

Thanos, Prometheus and Golang version used: 0.35.0

Object Storage Provider: Azure Storage Account

What happened:

After enable query.mode=distributed my querier get a lot of panics. Removing --query.mode=distributed stops all pancis

What you expected to happen:

No panics

How to reproduce it (as minimally and precisely as possible):

At the moment, I'm unable to provide a minimal reproducible environment. However, according ruler logs, all queries like absent(up{job=\"kube-proxy\"} == 1) (job can have any label) seems affected.

We are using stateless rulers and thanos receive, no sidecars.

Thanos querier arguments:


query --log.level=info --log.format=json --grpc-address=0.0.0.0:10901 --http-address=0.0.0.0:10902 --query.replica-label=replica --query.replica-label=prometheus_replica --query.replica-label=thanos_receive_replica --query.replica-label=thanos_ruler_replica --endpoint=opsstack-thanos-storegateway.opsstack.svc.cluster.local.:10901 --endpoint=opsstack-thanos-receive.opsstack.svc.cluster.local.:10901 --alert.query-url=http://opsstack-thanos-query.opsstack.svc.cluster.local:10902 --enable-auto-gomemlimit --query.promql-engine=thanos --query.mode=distributed --query.auto-downsampling --query.default-tenant-id=opsstack --web.disable-cors --web.prefix-header=X-Forwarded-Prefix

Full logs to relevant components:

Uncomment if you would like to post collapsible logs:

Ruler Logs

Just a few, they are repeating

{"caller":"rule.go:968","component":"rules","err":"rpc error: code = Internal desc = runtime error: index out of range [0] with length 0","level":"error","query":"absent(up{job=\"kube-proxy\"} == 1)","ts":"2024-05-02T13:03:09.954559928Z"}
{"caller":"rule.go:938","component":"rules","err":"read query instant response: perform POST request against http://opsstack-thanos-query.opsstack.svc.cluster.local:10902/api/v1/query: Post \"http://opsstack-thanos-query.opsstack.svc.cluster.local:10902/api/v1/query\": EOF","level":"error","query":"absent(up{job=\"apiserver\"} == 1)","ts":"2024-05-02T12:59:26.372712478Z"}

Querier Logs

https://gist.github.com/jkroepke/9fc58319bf819866138a8dae4f1c8d92

Anything else we need to know:

Environment:

  • OS (e.g. from /etc/os-release): Linux (Offical Thanos Images)
  • Version: quay.io/thanos/thanos:v0.35.0
@MichaHoffmann
Copy link
Contributor

MichaHoffmann commented May 2, 2024

The issue here was that the querier is not configured to point at query APIs but at store APIs ( we should guard against that better probably ); this leads to the promql-engine distributing to 0 other engines, which exposes a bug where we dont guard against that!

Edit: see https://cloud-native.slack.com/archives/CK5RSSC10/p1714654267669009

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants