Skip to content

Label Matching Problem in PromQL Queries for TCP Retransmission and Syn Retransmission Rates #949

@apankevics

Description

@apankevics

Issue Description

There is a label matching problem in the following PromQL queries:

sum by (instance) (rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_OutSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
sum by (instance) (rate(node_netstat_TcpExt_TCPSynRetrans{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})

Problem

The sum by (instance) aggregation is applied to the ratio calculations, but the kube_pod_info metric is not aggregated on the instance label, and it does not appear in the on clause. As a result, the join operation is performed on the cluster, namespace, and pod labels, which might lead to incorrect comparisons or misleading results.

Steps to Reproduce

  1. Execute the above PromQL queries in Prometheus:
    sum by (instance) (rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_OutSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
    
    sum by (instance) (rate(node_netstat_TcpExt_TCPSynRetrans{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) / rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[%(grafanaIntervalVar)s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
    
  2. Observe the results, which are shown as:
    sum by (instance) (rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[1m0s]) / rate(node_netstat_Tcp_OutSegs{%(clusterLabel)s="$cluster"}[1m0s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
    
    sum by (instance) (rate(node_netstat_TcpExt_TCPSynRetrans{%(clusterLabel)s="$cluster"}[1m0s]) / rate(node_netstat_Tcp_RetransSegs{%(clusterLabel)s="$cluster"}[1m0s]) * on (%(clusterLabel)s,namespace,pod) kube_pod_info{host_network="false"})
    
  3. Notice that the results are incorrect due to the mismatch in labels used in the join operation.

Expected Behavior

The queries should correctly aggregate and join the metrics on the appropriate labels to avoid misleading results.

Possible Solution

To fix the issue, ensure that the instance label is considered in the join operation or modify the aggregation strategy. One possible solution might be to aggregate kube_pod_info on the instance label as well.

Changes

The label matching problem was introduced in the following commit:

d63872c

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is neededkeepaliveUse to prevent automatic closing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions