-
Notifications
You must be signed in to change notification settings - Fork 620
Description
Issue Description
There is a label matching problem in the following PromQL queries:
sum by (instance) (rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_OutSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
sum by (instance) (rate(node_netstat_TcpExt_TCPSynRetrans{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
Problem
The sum by (instance)
aggregation is applied to the ratio calculations, but the kube_pod_info
metric is not aggregated on the instance
label, and it does not appear in the on
clause. As a result, the join operation is performed on the cluster
, namespace
, and pod
labels, which might lead to incorrect comparisons or misleading results.
Steps to Reproduce
- Execute the above PromQL queries in Prometheus:
sum by (instance) (rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_OutSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
sum by (instance) (rate(node_netstat_TcpExt_TCPSynRetrans{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
- Observe the results, which are shown as:
sum by (instance) (rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[1m0s]) / rate(node_netstat_Tcp_OutSegs{%(clusterLabel)s="$cluster"}[1m0s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
sum by (instance) (rate(node_netstat_TcpExt_TCPSynRetrans{%(clusterLabel)s="$cluster"}[1m0s]) / rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[1m0s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
- Notice that the results are incorrect due to the mismatch in labels used in the join operation.
Expected Behavior
The queries should correctly aggregate and join the metrics on the appropriate labels to avoid misleading results.
Possible Solution
To fix the issue, ensure that the instance
label is considered in the join operation or modify the aggregation strategy. One possible solution might be to aggregate kube_pod_info
on the instance
label as well.
Changes
The label matching problem was introduced in the following commit: