You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a label matching problem in the following PromQL queries:
sum by (instance) (rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_OutSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
sum by (instance) (rate(node_netstat_TcpExt_TCPSynRetrans{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
Problem
The sum by (instance) aggregation is applied to the ratio calculations, but the kube_pod_info metric is not aggregated on the instance label, and it does not appear in the on clause. As a result, the join operation is performed on the cluster, namespace, and pod labels, which might lead to incorrect comparisons or misleading results.
Steps to Reproduce
Execute the above PromQL queries in Prometheus:
sum by (instance) (rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_OutSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
sum by (instance) (rate(node_netstat_TcpExt_TCPSynRetrans{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
Observe the results, which are shown as:
sum by (instance) (rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[1m0s]) / rate(node_netstat_Tcp_OutSegs{%(clusterLabel)s="$cluster"}[1m0s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
sum by (instance) (rate(node_netstat_TcpExt_TCPSynRetrans{%(clusterLabel)s="$cluster"}[1m0s]) / rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[1m0s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
Notice that the results are incorrect due to the mismatch in labels used in the join operation.
Expected Behavior
The queries should correctly aggregate and join the metrics on the appropriate labels to avoid misleading results.
Possible Solution
To fix the issue, ensure that the instance label is considered in the join operation or modify the aggregation strategy. One possible solution might be to aggregate kube_pod_info on the instance label as well.
Changes
The label matching problem was introduced in the following commit:
This issue has not had any activity in the past 30 days, so the stale label has been added to it.
The stale label will be removed if there is new activity * The issue will be closed in 2 days if there is no new activity * Add the keepalive label to exempt this issue from the stale check action
Thank you for your contributions!
* on (%(clusterLabel)s,namespace,pod) group_left ()
topk by (%(clusterLabel)s,namespace,pod) (
1,
max by (%(clusterLabel)s,namespace,pod) (kube_pod_info{host_network="false"})
)
)
instance label on node exporter is usually the node, but I think instance label on kube-state-metrics is often the Prometheus instance and the node name is usually on the node label, at least in my scrape config. So instance != instance, at least for me.
node_netstat_Tcp_RetransSegs doesn't have namespace or pod labels, so joining on these labels doesn't make sense here - only cluster will match in the end.
The original problem that the commit mentioned in the issue description (#972) seems valid, but the current solution of joining against kube_pod_info is likely only to work with queries based on cAdvisor/kube-state-metrics and not node exporter.
My suggestion would be to remove these new joins from the node exporter queries and keep them where they make sense.
Issue Description
There is a label matching problem in the following PromQL queries:
Problem
The
sum by (instance)
aggregation is applied to the ratio calculations, but thekube_pod_info
metric is not aggregated on theinstance
label, and it does not appear in theon
clause. As a result, the join operation is performed on thecluster
,namespace
, andpod
labels, which might lead to incorrect comparisons or misleading results.Steps to Reproduce
Expected Behavior
The queries should correctly aggregate and join the metrics on the appropriate labels to avoid misleading results.
Possible Solution
To fix the issue, ensure that the
instance
label is considered in the join operation or modify the aggregation strategy. One possible solution might be to aggregatekube_pod_info
on theinstance
label as well.Changes
The label matching problem was introduced in the following commit:
d63872c
The text was updated successfully, but these errors were encountered: