Bug 1908655 - "Evaluating rule failed" for "record: node:node_num_cpu:sum" rule
Summary: "Evaluating rule failed" for "record: node:node_num_cpu:sum" rule
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.8.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1924864 1982795 (view as bug list)
Depends On:
Blocks: 1922053
TreeView+ depends on / blocked
 
Reported: 2020-12-17 09:32 UTC by Junqi Zhao
Modified: 2021-07-27 22:35 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1922053 (view as bug list)
Environment:
[sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [sig-instrumentation] Prometheus when installed on the cluster shouldn't have failing rules evaluation [Suite:openshift/conformance/parallel]
Last Closed: 2021-07-27 22:35:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
prometheus container logs (32.29 KB, text/plain)
2020-12-17 09:32 UTC, Junqi Zhao
no flags Details
pod scheduled to other node during a short time (130.29 KB, image/png)
2021-01-19 10:07 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 553 0 None closed rules/node.libsonnet: fix many-to-many errors for node:node_num_cpu:sum 2021-04-20 11:21:24 UTC
Github openshift cluster-monitoring-operator pull 1044 0 None closed Bug 1923984: Refactor jsonnet to include latest kube-prometheus 2021-04-20 11:21:24 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:35:48 UTC

Description Junqi Zhao 2020-12-17 09:32:02 UTC
Created attachment 1739918 [details]
prometheus container logs

Created attachment 1739918 [details]
prometheus container logs

Description of problem:
enabled user workload and upgrade from 4.6.8 to 4.7.0-0.nightly-2020-12-14-165231, "many-to-many matching not allowed: matching labels must be unique on one side" for "record: node:node_num_cpu:sum"

record:node:node_num_cpu:sum
expr:count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))

checked alertmanager-main-2 is in ip-10-0-67-241.us-east-2.compute.internal node, prometheus-user-workload-0 is in ip-10-0-60-250.us-east-2.compute.internal node

# oc -n openshift-monitoring logs -c  prometheus prometheus-k8s-0
...
level=warn ts=2020-12-17T07:57:05.334Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"alertmanager-main-2\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-70-152.us-east-2.compute.internal\", pod=\"alertmanager-main-2\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"alertmanager-main-2\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-12-17T07:57:35.333Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"alertmanager-main-2\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-70-152.us-east-2.compute.internal\", pod=\"alertmanager-main-2\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"alertmanager-main-2\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-12-17T07:58:05.331Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"prometheus-k8s-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-70-152.us-east-2.compute.internal\", pod=\"prometheus-k8s-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-k8s-0\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-12-17T07:58:35.316Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-user-workload-monitoring\", pod=\"prometheus-user-workload-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-60-250.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-12-17T07:59:05.311Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-user-workload-monitoring\", pod=\"prometheus-user-workload-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-60-250.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-12-17T07:59:35.306Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-user-workload-monitoring\", pod=\"prometheus-user-workload-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-60-250.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}];many-to-many matching not allowed: matching labels must be unique on one side"
level=warn ts=2020-12-17T08:00:05.304Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-user-workload-monitoring\", pod=\"prometheus-user-workload-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-60-250.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}];many-to-many matching not allowed: matching labels must be unique on one side"
*****************************************************
# oc -n openshift-monitoring get pod -o wide | grep -E "ip-10-0-70-152.us-east-2.compute.internal|ip-10-0-67-241.us-east-2.compute.internal|ip-10-0-60-250.us-east-2.compute.internal"
alertmanager-main-0                            5/5     Running   0          86m    10.129.2.22   ip-10-0-67-241.us-east-2.compute.internal   <none>           <none>
alertmanager-main-1                            5/5     Running   0          93m    10.131.0.15   ip-10-0-60-250.us-east-2.compute.internal   <none>           <none>
alertmanager-main-2                            5/5     Running   0          86m    10.129.2.21   ip-10-0-67-241.us-east-2.compute.internal   <none>           <none>
grafana-65fd97ffcc-gxdsk                       2/2     Running   0          87m    10.129.2.11   ip-10-0-67-241.us-east-2.compute.internal   <none>           <none>
kube-state-metrics-7b8cdbc644-8tmft            3/3     Running   0          87m    10.129.2.12   ip-10-0-67-241.us-east-2.compute.internal   <none>           <none>
node-exporter-49bk6                            2/2     Running   0          107m   10.0.70.152   ip-10-0-70-152.us-east-2.compute.internal   <none>           <none>
node-exporter-lmc94                            2/2     Running   0          108m   10.0.60.250   ip-10-0-60-250.us-east-2.compute.internal   <none>           <none>
node-exporter-vd28r                            2/2     Running   0          107m   10.0.67.241   ip-10-0-67-241.us-east-2.compute.internal   <none>           <none>
openshift-state-metrics-5d9b6f864d-cfjlk       3/3     Running   0          90m    10.131.0.9    ip-10-0-60-250.us-east-2.compute.internal   <none>           <none>
prometheus-adapter-f89987c8d-lcvlz             1/1     Running   0          70m    10.128.2.20   ip-10-0-70-152.us-east-2.compute.internal   <none>           <none>
prometheus-adapter-f89987c8d-pfgmv             1/1     Running   0          70m    10.131.0.34   ip-10-0-60-250.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-0                               7/7     Running   1          86m    10.129.2.23   ip-10-0-67-241.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   1          93m    10.131.0.18   ip-10-0-60-250.us-east-2.compute.internal   <none>           <none>
telemeter-client-5744dd57b-q45jq               3/3     Running   0          90m    10.131.0.13   ip-10-0-60-250.us-east-2.compute.internal   <none>           <none>
thanos-querier-969f6558d-9m6td                 5/5     Running   0          87m    10.129.2.10   ip-10-0-67-241.us-east-2.compute.internal   <none>           <none>
thanos-querier-969f6558d-z28t2                 5/5     Running   0          90m    10.131.0.14   ip-10-0-60-250.us-east-2.compute.internal   <none>           <none>

# oc -n openshift-user-workload-monitoring get pod -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
prometheus-operator-65d48f7b88-t6clv   2/2     Running   0          83m   10.129.0.33   ip-10-0-70-172.us-east-2.compute.internal   <none>           <none>
prometheus-user-workload-0             5/5     Running   1          88m   10.131.0.20   ip-10-0-60-250.us-east-2.compute.internal   <none>           <none>
prometheus-user-workload-1             5/5     Running   1          84m   10.129.2.18   ip-10-0-67-241.us-east-2.compute.internal   <none>           <none>
thanos-ruler-user-workload-0           3/3     Running   1          84m   10.129.2.16   ip-10-0-67-241.us-east-2.compute.internal   <none>           <none>
thanos-ruler-user-workload-1           3/3     Running   1          88m   10.131.0.19   ip-10-0-60-250.us-east-2.compute.internal   <none>           <none>

node_namespace_pod:kube_pod_info:{pod="alertmanager-main-2"} result is
node_namespace_pod:kube_pod_info:{namespace="openshift-monitoring", node="ip-10-0-67-241.us-east-2.compute.internal", pod="alertmanager-main-2"}        1

node_namespace_pod:kube_pod_info:{pod="prometheus-user-workload-0"} result is
node_namespace_pod:kube_pod_info:{namespace="openshift-user-workload-monitoring", node="ip-10-0-60-250.us-east-2.compute.internal", pod="prometheus-user-workload-0"}   1



How reproducible:
sometimes

Steps to Reproduce:
1.upgrade from 4.6.8 to 4.7.0-0.nightly-2020-12-14-16523
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 Junqi Zhao 2021-01-19 10:07:36 UTC
Created attachment 1748680 [details]
pod scheduled to other node during a short time

Comment 7 Sergiusz Urbaniak 2021-02-04 09:55:49 UTC
*** Bug 1924864 has been marked as a duplicate of this bug. ***

Comment 10 Simon Pasquier 2021-04-20 11:21:25 UTC
The fix was included as part of https://github.com/openshift/cluster-monitoring-operator/pull/1044 but I forgot to update this BZ.

Comment 12 Junqi Zhao 2021-04-29 14:11:46 UTC
upgrade from 4.7.8 to 4.8.0-fc.1, no such warn info now

Comment 14 Simon Pasquier 2021-07-22 07:42:03 UTC
*** Bug 1982795 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2021-07-27 22:35:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.