Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana only shows raw data from Thanos #7296

Open
stalemate3 opened this issue Apr 22, 2024 · 3 comments
Open

Grafana only shows raw data from Thanos #7296

stalemate3 opened this issue Apr 22, 2024 · 3 comments

Comments

@stalemate3
Copy link

Thanos, Prometheus, Grafana and Golang version used:

Thanos: Bitnami Thanos helm chart version 12.23.2 (application version: 0.34.0)
Prometheus: Kube-prometheus-stack helm chart version 56.4.0 (application version 2.49.1)
Grafana: Bitnami Grafana operator helm chart version 3.5.14 (application version 10.2.3)
Golang: go1.21.6

Object Storage Provider:

AWS S3

What happened:

I have Thanos installed on 3 AWS EKS clusters and currently I discovered an issue on all cluster where I'm trying to query data from Thanos in Grafana but it only shows raw data and not the down sampled.

The configurations and versions are all the same on all 3 k8s cluster.
I'm using the default resolutions:

compactor.retentionResolutionRaw 30d
compactor.retentionResolution5m 30d
compactor.retentionResolution1h 10y

But here's what I see on Grafana:
chrome_ISsDv6JyHD

I'm quite sure about the fact that it doesn't use the down sampled data because last month I discovered an issue with the Compactor where it didn't compact the raw data because the default PVC size was not enough for it (8Gi). With this incorrect usage Thanos only had the raw data and on Grafana all data was shown, not like in the above picture shown. After increasing the PVC size for Compactor, it compacted almost 1 year of data successfully and it looks fine to me on the Bucketweb:
chrome_Xf6CAIK9Ng

From these 2 pictures it is clear that the earliest point of data in Grafana matches the raw data Start Time.

What you expected to happen:

To see the 1h auto down sampled data on Grafana instead of the raw data which is stored by default for 30 days.

Noteworthy information:

TBH I'm not sure if it is Thanos or Grafana issue but given from the fact that the Grafana dashboards worked perfectly fine with the raw data and not with the down sampled data, my best guess is that there is an issue with Thanos. I'm happy to be proven wrong here, it is an issue which took me more time than it should already.

Full logs to relevant components:

Thanos Query config

      - args:
        - query
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --query.replica-label=replica
        - --endpoint=dnssrv+_grpc._tcp.prometheus-operated.monitoring-prometheus-ns.svc.cluster.local
        - --endpoint=dnssrv+_grpc._tcp.monitoring-thanos-storegateway.monitoring-prometheus-ns.svc.cluster.local
        - --web.route-prefix=/thanos
        - --web.external-prefix=/thanos

Thanos Query Logs

ts=2024-04-22T14:23:14.41442275Z caller=options.go:26 level=info protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
ts=2024-04-22T14:23:14.414991696Z caller=query.go:813 level=info msg="starting query node"
ts=2024-04-22T14:23:14.415078037Z caller=intrumentation.go:75 level=info msg="changing probe status" status=healthy
ts=2024-04-22T14:23:14.415117759Z caller=http.go:73 level=info service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:10902
ts=2024-04-22T14:23:14.415253245Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready
ts=2024-04-22T14:23:14.415381003Z caller=grpc.go:131 level=info service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
ts=2024-04-22T14:23:14.415390033Z caller=tls_config.go:274 level=info service=http/server component=query msg="Listening on" address=[::]:10902
ts=2024-04-22T14:23:14.415418956Z caller=tls_config.go:277 level=info service=http/server component=query msg="TLS is disabled." http2=false address=[::]:10902
ts=2024-04-22T14:23:19.427087236Z caller=endpointset.go:425 level=info component=endpointset msg="adding new sidecar with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI]" address=REMOVED extLset="{prometheus=\"monitoring-prometheus-ns/monitoring-prometheus-prometheus\", prometheus_replica=\"prometheus-monitoring-prometheus-prometheus-0\"}"
ts=2024-04-22T14:23:19.427133639Z caller=endpointset.go:425 level=info component=endpointset msg="adding new store with [storeEndpoints]" address=REMOVED extLset="{customer=\"REMOVED\", prometheus=\"monitoring-prometheus-ns/monitoring-prometheus-prometheus\", prometheus_replica=\"prometheus-monitoring-prometheus-prometheus-0\"},{prometheus=\"monitoring-prometheus-ns/monitoring-prometheus-prometheus\", prometheus_replica=\"prometheus-monitoring-prometheus-prometheus-0\"}"

Please let me know if you need any more important log, config or whatever useful information.

@douglascamata
Copy link
Contributor

@stalemate3 you are telling Grafana that you want the minimum step to be 1 hour. This means one datapoint per hour at minimum... at this step size Thanos decided that it will use raw data to answer the query. By default, Thanos does not reply to queries mixing data from different levels -- only a single is used to answer queries. So this is why you don't see the downsampled data there.

You can do two things:

  1. Enable add the CLI arg --query-range.request-downsampled=true to your Query Frontend, if you run it. This makes Thanos query for all the downsampling levels, going up in resolution one by one, to check if there's data.
  2. Remove or change that min step in your query to try to hit an specific downsampling levels.

Personally I would go with option 1.

@stalemate3
Copy link
Author

Sorry for the late reply and thanks for the suggestions, there is some progress with my issue. I've tried both of your suggestions, the 1st one doesn't seem to do anything when I redeploy Query Frontend with that arg from Grafana side. For the second one, that's only there because of my desperation to try random steps to magically fix my problem, sorry about that.

From another issue here I set the --query.auto-downsampling to Query and as I've played around with the min step I set it to 5h and voila:
chrome_8UwxcQI6Ee

Anything below that is going to give back the same data as in my original post. I'm not sure why 5h is this magic spot but now there are these gaps in the data. If I set it above 5h the gap widens. Any ideas to fix these gaps?

Thanks for the ideas, I really appreciate it!

@MichaHoffmann
Copy link
Contributor

I think there is some heuristic that max resolution is step/5 or something somewhere! 5h would be answered from 1h down samples data then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants