Created attachment 1739918 [details] prometheus container logs Created attachment 1739918 [details] prometheus container logs Description of problem: enabled user workload and upgrade from 4.6.8 to 4.7.0-0.nightly-2020-12-14-165231, "many-to-many matching not allowed: matching labels must be unique on one side" for "record: node:node_num_cpu:sum" record:node:node_num_cpu:sum expr:count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)) checked alertmanager-main-2 is in ip-10-0-67-241.us-east-2.compute.internal node, prometheus-user-workload-0 is in ip-10-0-60-250.us-east-2.compute.internal node # oc -n openshift-monitoring logs -c prometheus prometheus-k8s-0 ... level=warn ts=2020-12-17T07:57:05.334Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"alertmanager-main-2\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-70-152.us-east-2.compute.internal\", pod=\"alertmanager-main-2\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"alertmanager-main-2\"}];many-to-many matching not allowed: matching labels must be unique on one side" level=warn ts=2020-12-17T07:57:35.333Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"alertmanager-main-2\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-70-152.us-east-2.compute.internal\", pod=\"alertmanager-main-2\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"alertmanager-main-2\"}];many-to-many matching not allowed: matching labels must be unique on one side" level=warn ts=2020-12-17T07:58:05.331Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"prometheus-k8s-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-70-152.us-east-2.compute.internal\", pod=\"prometheus-k8s-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-k8s-0\"}];many-to-many matching not allowed: matching labels must be unique on one side" level=warn ts=2020-12-17T07:58:35.316Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-user-workload-monitoring\", pod=\"prometheus-user-workload-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-60-250.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}];many-to-many matching not allowed: matching labels must be unique on one side" level=warn ts=2020-12-17T07:59:05.311Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-user-workload-monitoring\", pod=\"prometheus-user-workload-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-60-250.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}];many-to-many matching not allowed: matching labels must be unique on one side" level=warn ts=2020-12-17T07:59:35.306Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-user-workload-monitoring\", pod=\"prometheus-user-workload-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-60-250.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}];many-to-many matching not allowed: matching labels must be unique on one side" level=warn ts=2020-12-17T08:00:05.304Z caller=manager.go:598 component="rule manager" group=node.rules msg="Evaluating rule failed" rule="record: node:node_num_cpu:sum\nexpr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job=\"node-exporter\"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))\n" err="found duplicate series for the match group {namespace=\"openshift-user-workload-monitoring\", pod=\"prometheus-user-workload-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-67-241.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-user-workload-monitoring\", node=\"ip-10-0-60-250.us-east-2.compute.internal\", pod=\"prometheus-user-workload-0\"}];many-to-many matching not allowed: matching labels must be unique on one side" ***************************************************** # oc -n openshift-monitoring get pod -o wide | grep -E "ip-10-0-70-152.us-east-2.compute.internal|ip-10-0-67-241.us-east-2.compute.internal|ip-10-0-60-250.us-east-2.compute.internal" alertmanager-main-0 5/5 Running 0 86m 10.129.2.22 ip-10-0-67-241.us-east-2.compute.internal <none> <none> alertmanager-main-1 5/5 Running 0 93m 10.131.0.15 ip-10-0-60-250.us-east-2.compute.internal <none> <none> alertmanager-main-2 5/5 Running 0 86m 10.129.2.21 ip-10-0-67-241.us-east-2.compute.internal <none> <none> grafana-65fd97ffcc-gxdsk 2/2 Running 0 87m 10.129.2.11 ip-10-0-67-241.us-east-2.compute.internal <none> <none> kube-state-metrics-7b8cdbc644-8tmft 3/3 Running 0 87m 10.129.2.12 ip-10-0-67-241.us-east-2.compute.internal <none> <none> node-exporter-49bk6 2/2 Running 0 107m 10.0.70.152 ip-10-0-70-152.us-east-2.compute.internal <none> <none> node-exporter-lmc94 2/2 Running 0 108m 10.0.60.250 ip-10-0-60-250.us-east-2.compute.internal <none> <none> node-exporter-vd28r 2/2 Running 0 107m 10.0.67.241 ip-10-0-67-241.us-east-2.compute.internal <none> <none> openshift-state-metrics-5d9b6f864d-cfjlk 3/3 Running 0 90m 10.131.0.9 ip-10-0-60-250.us-east-2.compute.internal <none> <none> prometheus-adapter-f89987c8d-lcvlz 1/1 Running 0 70m 10.128.2.20 ip-10-0-70-152.us-east-2.compute.internal <none> <none> prometheus-adapter-f89987c8d-pfgmv 1/1 Running 0 70m 10.131.0.34 ip-10-0-60-250.us-east-2.compute.internal <none> <none> prometheus-k8s-0 7/7 Running 1 86m 10.129.2.23 ip-10-0-67-241.us-east-2.compute.internal <none> <none> prometheus-k8s-1 7/7 Running 1 93m 10.131.0.18 ip-10-0-60-250.us-east-2.compute.internal <none> <none> telemeter-client-5744dd57b-q45jq 3/3 Running 0 90m 10.131.0.13 ip-10-0-60-250.us-east-2.compute.internal <none> <none> thanos-querier-969f6558d-9m6td 5/5 Running 0 87m 10.129.2.10 ip-10-0-67-241.us-east-2.compute.internal <none> <none> thanos-querier-969f6558d-z28t2 5/5 Running 0 90m 10.131.0.14 ip-10-0-60-250.us-east-2.compute.internal <none> <none> # oc -n openshift-user-workload-monitoring get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prometheus-operator-65d48f7b88-t6clv 2/2 Running 0 83m 10.129.0.33 ip-10-0-70-172.us-east-2.compute.internal <none> <none> prometheus-user-workload-0 5/5 Running 1 88m 10.131.0.20 ip-10-0-60-250.us-east-2.compute.internal <none> <none> prometheus-user-workload-1 5/5 Running 1 84m 10.129.2.18 ip-10-0-67-241.us-east-2.compute.internal <none> <none> thanos-ruler-user-workload-0 3/3 Running 1 84m 10.129.2.16 ip-10-0-67-241.us-east-2.compute.internal <none> <none> thanos-ruler-user-workload-1 3/3 Running 1 88m 10.131.0.19 ip-10-0-60-250.us-east-2.compute.internal <none> <none> node_namespace_pod:kube_pod_info:{pod="alertmanager-main-2"} result is node_namespace_pod:kube_pod_info:{namespace="openshift-monitoring", node="ip-10-0-67-241.us-east-2.compute.internal", pod="alertmanager-main-2"} 1 node_namespace_pod:kube_pod_info:{pod="prometheus-user-workload-0"} result is node_namespace_pod:kube_pod_info:{namespace="openshift-user-workload-monitoring", node="ip-10-0-60-250.us-east-2.compute.internal", pod="prometheus-user-workload-0"} 1 How reproducible: sometimes Steps to Reproduce: 1.upgrade from 4.6.8 to 4.7.0-0.nightly-2020-12-14-16523 2. 3. Actual results: Expected results: Additional info:
Created attachment 1748680 [details] pod scheduled to other node during a short time
*** Bug 1924864 has been marked as a duplicate of this bug. ***
The fix was included as part of https://github.com/openshift/cluster-monitoring-operator/pull/1044 but I forgot to update this BZ.
upgrade from 4.7.8 to 4.8.0-fc.1, no such warn info now
*** Bug 1982795 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438