Created attachment 1862787 [details] response code for Pod count graph - 422 Description of problem: cluster admin, login console, Overview page, "Cluster utilization" section, graph for Pod count changed from shown and disappeared periodically. Pod count prometheus expr is: **************************** count( ( kube_running_pod_ready * on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:) ) * on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"})) ) **************************** checked the response, the response code for Pod count graph changed from 422 to 200 periodically, see 422 response code Request URL: https://console-openshift-console.apps.qe-ui411-0223.qe.devcluster.openshift.com/api/prometheus/api/v1/query_range?start=1645585529.018&end=1645589129.018&step=60&query=%0A++++++count%28%0A++++++++%28%0A++++++++++kube_running_pod_ready%0A++++++++++*%0A++++++++++on%28pod%2Cnamespace%29+group_left%28node%29+%28node_namespace_pod%3Akube_pod_info%3A%29%0A++++++++%29%0A++++++++*%0A++++++++on%28node%29+group_left%28role%29+%28max+by+%28node%29+%28kube_node_role%7Brole%3D%7E%22.%2B%22%7D%29%29%0A++++++%29%0A++++ Request Method: GET Status Code: 422 Unprocessable Entity Remote Address: 10.68.5.41:3128 Referrer Policy: strict-origin-when-cross-origin response is {"status":"error","errorType":"execution","error":"found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"prometheus-k8s-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-163-32.us-east-2.compute.internal\", pod=\"prometheus-k8s-0\", prometheus=\"openshift-monitoring/k8s\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-147-176.us-east-2.compute.internal\", pod=\"prometheus-k8s-0\", prometheus=\"openshift-monitoring/k8s\"}];many-to-many matching not allowed: matching labels must be unique on one side"} after about 30 seconds, the response is 200, could see the graph, see from "response code for Pod count graph - 200" picture checked, reason is prometheus-k8s-0 killed and rescheduled from ip-10-0-163-32.us-east-2.compute.internal to another node ip-10-0-147-176.us-east-2.compute.internal # oc -n openshift-monitoring get event | grep "Successfully assigned openshift-monitoring/prometheus-k8s-0" 176m Normal Scheduled pod/prometheus-k8s-0 Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-163-32.us-east-2.compute.internal 163m Normal Scheduled pod/prometheus-k8s-0 Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-163-32.us-east-2.compute.internal 27m Normal Scheduled pod/prometheus-k8s-0 Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-147-176.us-east-2.compute.internal # oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s prometheus-k8s-0 6/6 Running 0 27m 10.129.2.20 ip-10-0-147-176.us-east-2.compute.internal <none> <none> prometheus-k8s-1 6/6 Running 0 164m 10.128.2.13 ip-10-0-197-160.us-east-2.compute.internal <none> <none> continue to monitor for half an hour, there will not 424 error # oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s prometheus-k8s-0 6/6 Running 0 65m 10.129.2.20 ip-10-0-147-176.us-east-2.compute.internal <none> <none> prometheus-k8s-1 6/6 Running 0 3h22m 10.128.2.13 ip-10-0-197-160.us-east-2.compute.internal <none> <none> Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-02-18-121223 How reproducible: prometheus-k8s-0 killed and rescheduled Steps to Reproduce: 1. see the description 2. 3. Actual results: graph for Pod count changed from shown and disappeared periodically Expected results: graph for Pod count show should corretly Additional info: this is related to prometheus
also monitored in other cluster, if pods are rescheduled, not only for promtheus-k8s pod, in the first a few minutes, I think almost 30 minutes, the Pod count graph will not show, Request URL: https://console-openshift-console.apps.qe-daily-0223.qe.devcluster.openshift.com/api/prometheus/api/v1/query_range?start=1645599083.446&end=1645602683.446&step=60&query=%0A++++++count%28%0A++++++++%28%0A++++++++++kube_running_pod_ready%0A++++++++++*%0A++++++++++on%28pod%2Cnamespace%29+group_left%28node%29+%28node_namespace_pod%3Akube_pod_info%3A%29%0A++++++++%29%0A++++++++*%0A++++++++on%28node%29+group_left%28role%29+%28max+by+%28node%29+%28kube_node_role%7Brole%3D%7E%22.%2B%22%7D%29%29%0A++++++%29%0A++++ response code: 422, error {"status":"error","errorType":"execution","error":"found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"alertmanager-main-1\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-194-133.ap-south-1.compute.internal\", pod=\"alertmanager-main-1\", prometheus=\"openshift-monitoring/k8s\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-128-68.ap-south-1.compute.internal\", pod=\"alertmanager-main-1\", prometheus=\"openshift-monitoring/k8s\"}];many-to-many matching not allowed: matching labels must be unique on one side"} # oc -n openshift-monitoring get pod -o wide | grep alertmanager-main-1 alertmanager-main-1 6/6 Running 0 26m 10.128.2.62 ip-10-0-128-68.ap-south-1.compute.internal <none> <none>
(In reply to Junqi Zhao from comment #5) > also monitored in other cluster, if pods are rescheduled, not only for > promtheus-k8s pod, in the first a few minutes, I think almost 30 minutes, > the Pod count graph will not show, > > Request URL: > https://console-openshift-console.apps.qe-daily-0223.qe.devcluster.openshift. > com/api/prometheus/api/v1/query_range?start=1645599083.446&end=1645602683. > 446&step=60&query=%0A++++++count%28%0A++++++++%28%0A++++++++++kube_running_po > d_ready%0A++++++++++*%0A++++++++++on%28pod%2Cnamespace%29+group_left%28node%2 > 9+%28node_namespace_pod%3Akube_pod_info%3A%29%0A++++++++%29%0A++++++++*%0A+++ > +++++on%28node%29+group_left%28role%29+%28max+by+%28node%29+%28kube_node_role > %7Brole%3D%7E%22.%2B%22%7D%29%29%0A++++++%29%0A++++ maybe the first hour, we won't see the Pod count graph, the start to end time is 1 hour # date --date='@1645599083.446' -u Wed Feb 23 06:51:23 UTC 2022 # date --date='@1645602683.446' -u Wed Feb 23 07:51:23 UTC 2022
I think group_left is misused here. Normally the left should indicate that the left side has multiple metrics that should match with one on the right. In the query **************************** count( ( kube_running_pod_ready * on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:) ) * on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"})) ) **************************** kube_running_pod_ready doesn't have a node label. This label comes from the right side. The error occurs when a pod is reported from two different nodes. Whether this is a staleness issue or an exporter bug, I think this query expression should be fixed. Additionally the group_left(role) in the second part of the query is superfluous as no operant has a role label. I think the following might fix the issue: **************************** count( ( kube_running_pod_ready * ignoring(node,uid) group_right node_namespace_pod:kube_pod_info:) ) * on(node) group_left (max by (node) (kube_node_role{role=~".+"})) ) ****************************
@juzhao Any chance you can help me checking the proposed fix? I have so far failed to reproduce this by deleting one or both prometheus-k8s pods. Does this maybe require another precondition?
(In reply to Jan Fajerski from comment #9) > @juzhao Any chance you can help me checking the proposed fix? I > have so far failed to reproduce this by deleting one or both prometheus-k8s > pods. Does this maybe require another precondition? it seems it's not easy to reproduce, rescheduled prometheus-k8s-0 pod to another node, different with the node where it's scheduled on, did not reproduce this issue, will keep an eye on it.
My intuition is that the many-to-many errors are due to staleness. "node_namespace_pod:kube_pod_info:" returns 2 series for the same (pod,namespace) tuple: 1 for the node where the old pod was scheduled and 1 for the node where the new node is scheduled. The usual fix is to add a surrounding "topk by(...) (1, ...)" clause grouping by the same labels as used by the on() join. count( ( kube_running_pod_ready * on(pod,namespace) group_left(node) (topk by(pod,namespace) (1,node_namespace_pod:kube_pod_info:)) ) * on(node) group_left() (max by (node) (kube_node_role{role=~".+"})) ) Jan is also right that "group_left(role)" for the last part of the query can be replaced by "group_left()" though I'm not even sure that it is needed since kube_node_role should always have a role label?
(In reply to Simon Pasquier from comment #11) > My intuition is that the many-to-many errors are due to staleness. > "node_namespace_pod:kube_pod_info:" returns 2 series for the same > (pod,namespace) tuple: 1 for the node where the old pod was scheduled and 1 > for the node where the new node is scheduled. > The usual fix is to add a surrounding "topk by(...) (1, ...)" clause > grouping by the same labels as used by the on() join. Right, I think in this case we can forego the topk though since we're not really interested in the label that causes the many-to-many error. @apickering
Sorry, didn't finish the comment above. @anpicker Can you point me to where this query lives?
Applying brute-force grep to openshift/console, I think it's here: https://github.com/openshift/console/blob/79236193e2bfd3bfa93929145b11d7c778c6ceda/frontend/packages/console-shared/src/promql/cluster-dashboard.ts#L375-L386 But it also seems that other queries would need to be sanitized.
checked with 4.11.0-0.nightly-2022-06-15-222801, the graph for "Pod count" is consistent, no matter we reschedule prometheus pod to other node or not.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069