Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate query failing from Grafana #7249

Open
Aransh opened this issue Apr 1, 2024 · 2 comments
Open

Rate query failing from Grafana #7249

Aransh opened this issue Apr 1, 2024 · 2 comments

Comments

@Aransh
Copy link

Aransh commented Apr 1, 2024

Thanos, Prometheus and Golang version used:
Thanos 0.34.0
Prometheus 2.51.0

Object Storage Provider: Linode

What happened:
I have Grafana deployed to my k8s cluster as part of the kube-prometheus-stack Helm chart.
It is connected to my Thanos querier as its main datasource (which is connected to various Thanos sidecars).

One of our performance engineers has raised my attention to an issue in Grafana specifically, with the following query (note this is using custom metrics from our apps):
sum (irate(starlord_http_requests_total{container=“starlord-cyber-feed”,namespace=“app”, cluster=“qa-1”}[1m])) by (cluster)

The problem is:
On local prometheus UI, or on Thanos querier UI, running this query works, no problems at all.
But on Grafana (as part of a dashboard, or generally on explore), as soon as we increase the time to >12h, the graph flattens down to 0…
Now, since the query is working just fine on both Prometheus and Thanos Querier, I am left to believe the issue here must be with Grafana
(as Thanos Querier is its datasource, so why would it provide a different response?)

Some example screenshots:
Here is the query set to 1 hour, in both Grafana and Thanos querier, looks all good:
Thanos-1
Grafana-1

Now, here it is in both, set to 24 hours:
Thanos-2
Grafana-2

I’ve tried debugging this and haven’t found much, what I did try:

  • Tried using “rate” instead of “irate”, same issue
  • Tried changing the datasource’s “scrape interval” to 30s (from the default 15), same issue
  • Tried updating Promteheus+Grafana+Thanos to latest version

Only lead I did find is this log line, with http status 400, matching my query:
logger=context userId=3 orgId=1 uname=<my-email> t=2024-04-01T16:07:47.268619112Z level=info msg="Request Completed" method=POST path=/api/ds/query status=400 remote_addr=10.2.1.129 time_ms=16 duration=16.278802ms size=13513 referer="https://<my-domain>/explore?orgId=1&panes=%7B%22r5m%22%3A%7B%22datasource%22%3A%22P5DCFC7561CCDE821%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22sum+%28rate%28starlord_http_requests_total%7Bcontainer%3D%5C%22starlord-cyber-feed%5C%22%2Cnamespace%3D%5C%22app%5C%22%2C+cluster%3D%5C%22qa-1%5C%22%7D%5B1m%5D%29%29+by+%28cluster%29%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2C%22uid%22%3A%22P5DCFC7561CCDE821%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22__auto%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-24h%22%2C%22to%22%3A%22now%22%7D%7D%7D&schemaVersion=1" handler=/api/ds/query status_source=downstream

So, possibly Thanos querier is failing to handle the query from Grafana for some reason? Thanos itself isn't showing anything on the log about this...

What you expected to happen:
I expected queries ran in Thanos querier, and in Grafana (which queries Thanos querier) to be the same.

How to reproduce it (as minimally and precisely as possible):
Not sure how to reproduce without our specific metrics, but general setup is kube-prometheus-stack + thanos-querier + thanos sidecar(s)

@Aransh
Copy link
Author

Aransh commented Apr 1, 2024

Got a response from Grafana support, issue was on their side

@Aransh Aransh closed this as completed Apr 1, 2024
@Aransh
Copy link
Author

Aransh commented Apr 1, 2024

Nevermind, Grafana support actually said this might indeed be a Thanos bug, would appreciate if you can take a look.
https://community.grafana.com/t/rate-query-failing-only-on-grafana/118325/4

@Aransh Aransh reopened this Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant