USE method dashboards are broken #30

gouthamve · 2018-06-15T06:08:46Z

Some recording rules are missing, we either add those or remove the dashboards totally. WDYT @tomwilkie @brancz ?

brancz · 2018-06-15T07:29:17Z

Can you show what you mean? For me everything works as far as I can tell.

gouthamve · 2018-06-15T08:45:35Z

This is on GKE:

Ksonnet here:

prometheus {
  _config+:: {
    namespace: 'default',

    // Kubernetes mixin
    kubeApiserverSelector: 'job="default/kubernetes"',
    jobs: {
      KubeAPI: $._config.kubeApiserverSelector,
    },
  },
}

USE - Cluster

USE - Node

Broken recording rule:

brancz · 2018-06-15T09:05:02Z

For me all of these work, I think you'll need to dig deeper into the recording rules and figure out which labeling if off or which metrics you are missing. Most likely it's the same problem for all of them. My first guess would be the node-exporter -> node name mappings that is done through kube-state-metrics metrics. If I recall correctly, @tomwilkie mentioned that for the kausal ksonnet prometheus setup he had to set the podLabel config to instance.

conradj87 · 2018-06-19T13:59:55Z

We saw this when using node_exporter version v0.16.0. There are breaking changes to many metric names: https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md#0160--2018-05-15

Reverting node_exporter to v0.15.2 fixed it for us.

tomwilkie · 2018-06-19T14:01:48Z

There’s a branch on node exporter repo with some updates for 0.16, I intend to move the node exporter specific stuff there.

…

On Tue, 19 Jun 2018 at 15:00, conradj87 ***@***.***> wrote: We saw this when using node_exporter version v0.16.0. There are breaking changes to many metric names: https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md#0160--2018-05-15 Reverting node_exporter to v0.15.2 fixed it for us. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbGheWyjbjlwtxVoaUdRsS15-4c9lylks5t-QP6gaJpZM4UpDvG> .

adamdecaf · 2018-06-20T20:26:13Z

I'm having the same problem. Looking at one empty dashboard shows this recording rule doesn't have any data. There are others, but they look to be the same issue as this.

record: node:node_cpu_utilisation:avg1m
expr: 1
  - avg by(node) (rate(node_cpu{job="node-exporter",mode="idle"}[1m])
  * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)

A quick change to the window ([5m]) shows data.

1 - avg by(node) (rate(node_cpu{job="node-exporter",mode="idle"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)

This works, but the recording rules would need their names changed when the metrics are fixed.

serathius · 2018-06-21T06:49:39Z

I also have this problem, but changing range window to [5m] didn't help

brancz · 2018-06-21T09:03:28Z

@serathius do the individual metrics return results for you?

node_namespace_pod:kube_pod_info:

and

node_cpu{job="node-exporter",mode="idle"}

serathius · 2018-06-21T09:21:55Z

Yes, In #38 also checked if I'm missing any labels

brancz · 2018-06-21T09:34:33Z

Could you share some results of each of them so we can see if the join should be possible? This might be due to labeling of your time-series.

serathius · 2018-06-21T09:43:31Z

Sure
For node_cpu{job="node-exporter",mode="idle"}

node_cpu{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="cpu0",instance="2fzqr",job="node-exporter",kubernetes_io_hostname="2fzqr",mode="idle",node="2fzqr"} | 68689.88
node_cpu{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="cpu0",instance="2vvkw",job="node-exporter",kubernetes_io_hostname="2vvkw",mode="idle",node="2vvkw"} | 69287.56
node_cpu{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="cpu0",instance="7z47q",job="node-exporter",kubernetes_io_hostname="7z47q",mode="idle",node="7z47q"} | 69089.03
node_cpu{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="cpu0",instance="vd69z",job="node-exporter",kubernetes_io_hostname="vd69z",mode="idle",node="vd69z"} | 64547.3
node_cpu{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="cpu1",instance="2fzqr",job="node-exporter",kubernetes_io_hostname="2fzqr",mode="idle",node="2fzqr"} | 68588.2
node_cpu{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="cpu1",instance="2vvkw",job="node-exporter",kubernetes_io_hostname="2vvkw",mode="idle",node="2vvkw"} | 69252.69
node_cpu{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="cpu1",instance="7z47q",job="node-exporter",kubernetes_io_hostname="7z47q",mode="idle",node="7z47q"} | 69022.58
node_cpu{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="cpu1",instance="vd69z",job="node-exporter",kubernetes_io_hostname="vd69z",mode="idle",node="vd69z"} | 64524.6

For node_namespace_pod:kube_pod_info:

node_namespace_pod:kube_pod_info:{namespace="kube-system",node="2fzqr",pod="alertmanager-0"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="2fzqr",pod="kube-proxy-4pgmp"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="2fzqr",pod="kube-state-metrics-57bbfb8457-wmmqc"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="2fzqr",pod="node-exporter-6x8j9"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="2fzqr",pod="prometheus-1"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="2vvkw",pod="alertmanager-2"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="2vvkw",pod="grafana-0"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="2vvkw",pod="kube-proxy-4j9hc"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="2vvkw",pod="node-exporter-mzn2t"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="7z47q",pod="alertmanager-1"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="7z47q",pod="kube-proxy-x8qtk"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="7z47q",pod="node-exporter-cv7k8"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="7z47q",pod="prometheus-0"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="vd69z",pod="kube-dns-86f4d74b45-2f8m5"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="vd69z",pod="kube-proxy-tggdk"} | 1
node_namespace_pod:kube_pod_info:{namespace="kube-system",node="vd69z",pod="node-exporter-7kmk7"} | 1

brancz · 2018-06-21T11:39:17Z

Your node_cpu time-series need the namespace and pod label as that's what's being joined on.

serathius · 2018-06-21T12:11:15Z

Isn't node_cpu coming from node-exporter so it will be same namespace and pod that is totally not related to result?

brancz · 2018-06-21T12:21:10Z

The point is that the kube_pod_info metric has further information about Pods, and the node-exporter is a pod. In order to find which entry of the kube_pod_info metric is the correct one we need to match them somehow, this is where this monitoring mixin defaults to using pod and namespace (note that this is configurable though if you wish to make a different choice). Furthermore a targets' labels should be to identify it, so namespace and pod labels are an appropriate choice for Kubernetes.

I recommend you to change your relabeling rules for your node-exporter scrape job to relabel the namespace and pod label onto those targets.

adamdecaf · 2018-06-22T15:51:50Z

I'm working around this another way. I have federation setup across a few prometheus setups and was scraping every 60s, which I think was too slow for [1m] ranges. Instead I dropped scrape_interval: 30s and am seeing graphs appear.

serathius · 2018-06-25T11:40:19Z

Migrating node_exporter job from node to pod and labeling with pod, namespace help for my case. Thx

ravishivt · 2018-08-23T20:00:37Z

I'm having a similar issue where node_cpu_seconds_total isn't giving the pod or namespace labels. It doesn't even have a node label. Output of node_cpu_seconds_total{mode="idle"}:

node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="0",heritage="Tiller",instance="10.80.20.46:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 423673.44
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="0",heritage="Tiller",instance="10.80.20.52:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 417097.16
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="0",heritage="Tiller",instance="10.80.20.54:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 430181.32
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="0",heritage="Tiller",instance="10.80.20.55:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 428958.78
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="0",heritage="Tiller",instance="10.80.20.59:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 429040.3
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="0",heritage="Tiller",instance="10.80.20.60:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 423115.3
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="1",heritage="Tiller",instance="10.80.20.46:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 424035.26
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="1",heritage="Tiller",instance="10.80.20.52:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 417366.19
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="1",heritage="Tiller",instance="10.80.20.54:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 430547.33
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="1",heritage="Tiller",instance="10.80.20.55:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 429587.41
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="1",heritage="Tiller",instance="10.80.20.59:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 429598.8
node_cpu_seconds_total{app="prometheus",chart="prometheus-7.0.2",component="node-exporter",cpu="1",heritage="Tiller",instance="10.80.20.60:9100",job="kubernetes-service-endpoints",kubernetes_name="get-prometheus-node-exporter",kubernetes_namespace="default",mode="idle",release="get-prometheus"} | 423558.67

I deployed prometheus server (+ kube state metrics + node exporter + alertmanager) through the prometheus helm chart using the chart's default values, including the chart's default scrape_configs. What would I need to change to make the custom rules and thus the dashboards work? Sorry if this is obvious, I'm new to prometheus.

Some additional info:

helm installed node-exporter 0.16.0 so I'm using the metric name changes in update the node exporter metrics #65. Most dashboards are working properly.
I changed all the job selectors to match the job names that helm deployed.
I bumped scrape_interval from 1m to 30s to fix the rate queries.

ravishivt · 2018-08-24T21:05:35Z

I figured out how to get these labels added thanks to the helpful prometheus service-discovery status UI page. Below is a diff of what I changed in the helm chart's scrape_configs. With this config change and all the other changes in my previous comment, all dashboards are working properly.

       - job_name: 'kubernetes-service-endpoints'
+        honor_labels: true

         kubernetes_sd_configs:
           - role: endpoints

         relabel_configs:
           - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
             action: keep
             regex: true
           - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
             action: replace
             target_label: __scheme__
             regex: (https?)
           - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
             action: replace
             target_label: __metrics_path__
             regex: (.+)
           - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
             action: replace
             target_label: __address__
             regex: ([^:]+)(?::\d+)?;(\d+)
             replacement: $1:$2
           - action: labelmap
             regex: __meta_kubernetes_service_label_(.+)
           - source_labels: [__meta_kubernetes_namespace]
             action: replace
-            target_label: kubernetes_namespace
+            target_label: namespace
           - source_labels: [__meta_kubernetes_service_name]
             action: replace
-            target_label: kubernetes_name
+            target_label: service
+          - source_labels: [__meta_kubernetes_pod_name]
+            action: replace
+            target_label: pod
+          - source_labels: [__meta_kubernetes_pod_node_name]
+            action: replace
+            target_label: node

brancz · 2018-08-25T19:48:12Z

Generally speaking I recommend not adding non-identifying information to a target's label-set, and instead add this information at query time. That's also the approach this mixin takes largly, it makes queries and thus dashboards and alerting rules a lot more reusable, and not as tightly coupled to the individual configuration of Prometheus.

ravishivt · 2018-08-25T21:17:17Z

@brancz Are you saying there is a better way to solve the problem? So many things in this mixin seem to be reliant on having namespace, pod, and node labels for the following join:

on (namespace, pod) group_left(node)
        node_namespace_pod:kube_pod_info:

If these labels aren't being added in at the prometheus configuration level, I don't know of a way to add them at "query time".

brancz · 2018-09-05T05:43:35Z

The namespace and pod labels are expected to be there on the target, which is how it is identified. The node label is actually joined onto a metric with the snippet you shared, as the node label exists in the kube_pod_info metric.

tbarrella · 2019-02-02T02:29:49Z

In case it helps, I was experiencing the same issue, but it turned out to be because I was using an installation from (a random commit from) master instead of a particular release. When I checked out and applied v0.28.0, the metric collection and dashes were fixed

serathius · 2019-02-20T10:00:54Z

@brancz Why do we need those joins?

For calculating "node:node_num_cpu:sum" we join "node_cpu_seconds_total" with "node_namespace_pod:kube_pod_info:".

I'm bad at promql, but my understanding is only change from joining is that we only use "node_cpu_seconds_total" values that are labeled with "pod" and "namespace" as existing pods.

Node exporter metric is already filtered by "nodeExporterSelector" so I don't understand what is benefit of this join. This also disallows scraping node_exporter from outside of cluster or outside kubernetes (systemd service).

I would prefer to have my node_exporter not labeled with "pod" and "namespace". My main problem is that all dashboards queries needs to be rewritten to remove those labels, because every time we redeploy node_exporter (update, config change) I get different timeseries.

brancz · 2019-02-20T10:27:08Z

It's being joined to reliably get the node label information. Node-exporter exposes both metrics about itself as well as about the node, so it's correct that the metrics are labelled with the new pod, but for the actual node metrics we should throw away the namespace/pod labels after we've joined. We can do this by using an aggregation function, then the resulting time-series should be stable.

serathius · 2019-02-20T10:55:22Z

Shouldn't then node_exporter have separate port for it's own metrics like kube-state-metrics?

We don't have any dashboards or alerts on node_exporter internal metrics. Should we care about them being properly labeled. For me this is tradeoff where I would prefer simpler node metrics then correctly labeled node_exporter internal metrics.

brancz · 2019-02-21T10:03:01Z

I do agree with your first statement, but there are various tradeoffs at work, for example node-exporter as opposed to kube-state-metrics runs on every node so on very large kubernetes clusters doubling the amount of requests for node metrics.

While there are no dashboards or alerts, it's still very valuable information, that has helped us numerous times in detecting cpu/memory (mainly cpu) issues/leaks. Having all the labels is the common denominator, so while I understand your point, I think the trade-off we have chosen currently is the more appropriate and exact one.

github-actions · 2024-12-12T00:27:05Z

This issue has not had any activity in the past 30 days, so the
stale label has been added to it.

The stale label will be removed if there is new activity
The issue will be closed in 7 days if there is no new activity
Add the keepalive label to exempt this issue from the stale check action

Thank you for your contributions!

serathius mentioned this issue Jun 20, 2018

Empty graphs in USE Method dashboards #38

Closed

github-actions bot added the stale label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE method dashboards are broken #30

USE method dashboards are broken #30

gouthamve commented Jun 15, 2018

brancz commented Jun 15, 2018 •

edited

Loading

gouthamve commented Jun 15, 2018 •

edited

Loading

brancz commented Jun 15, 2018

conradj87 commented Jun 19, 2018

tomwilkie commented Jun 19, 2018 via email

adamdecaf commented Jun 20, 2018 •

edited

Loading

serathius commented Jun 21, 2018

brancz commented Jun 21, 2018

serathius commented Jun 21, 2018

brancz commented Jun 21, 2018

serathius commented Jun 21, 2018

brancz commented Jun 21, 2018

serathius commented Jun 21, 2018

brancz commented Jun 21, 2018

adamdecaf commented Jun 22, 2018

serathius commented Jun 25, 2018

ravishivt commented Aug 23, 2018 •

edited

Loading

ravishivt commented Aug 24, 2018

brancz commented Aug 25, 2018

ravishivt commented Aug 25, 2018

brancz commented Sep 5, 2018

tbarrella commented Feb 2, 2019

serathius commented Feb 20, 2019

brancz commented Feb 20, 2019

serathius commented Feb 20, 2019 •

edited

Loading

brancz commented Feb 21, 2019 •

edited

Loading

github-actions bot commented Dec 12, 2024

USE method dashboards are broken #30

USE method dashboards are broken #30

Comments

gouthamve commented Jun 15, 2018

brancz commented Jun 15, 2018 • edited Loading

gouthamve commented Jun 15, 2018 • edited Loading

brancz commented Jun 15, 2018

conradj87 commented Jun 19, 2018

tomwilkie commented Jun 19, 2018 via email

adamdecaf commented Jun 20, 2018 • edited Loading

serathius commented Jun 21, 2018

brancz commented Jun 21, 2018

serathius commented Jun 21, 2018

brancz commented Jun 21, 2018

serathius commented Jun 21, 2018

brancz commented Jun 21, 2018

serathius commented Jun 21, 2018

brancz commented Jun 21, 2018

adamdecaf commented Jun 22, 2018

serathius commented Jun 25, 2018

ravishivt commented Aug 23, 2018 • edited Loading

ravishivt commented Aug 24, 2018

brancz commented Aug 25, 2018

ravishivt commented Aug 25, 2018

brancz commented Sep 5, 2018

tbarrella commented Feb 2, 2019

serathius commented Feb 20, 2019

brancz commented Feb 20, 2019

serathius commented Feb 20, 2019 • edited Loading

brancz commented Feb 21, 2019 • edited Loading

github-actions bot commented Dec 12, 2024

brancz commented Jun 15, 2018 •

edited

Loading

gouthamve commented Jun 15, 2018 •

edited

Loading

adamdecaf commented Jun 20, 2018 •

edited

Loading

ravishivt commented Aug 23, 2018 •

edited

Loading

serathius commented Feb 20, 2019 •

edited

Loading

brancz commented Feb 21, 2019 •

edited

Loading