Bug 2057251
| Summary: | response code for Pod count graph changed from 422 to 200 periodically for about 30 minutes if pod is rescheduled | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||
| Component: | Management Console | Assignee: | Jan Fajerski <jfajersk> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.11 | CC: | anpicker, aos-bugs, hongyli, spasquie | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.11.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-08-10 10:50:45 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Junqi Zhao
2022-02-23 04:29:05 UTC
also monitored in other cluster, if pods are rescheduled, not only for promtheus-k8s pod, in the first a few minutes, I think almost 30 minutes, the Pod count graph will not show, Request URL: https://console-openshift-console.apps.qe-daily-0223.qe.devcluster.openshift.com/api/prometheus/api/v1/query_range?start=1645599083.446&end=1645602683.446&step=60&query=%0A++++++count%28%0A++++++++%28%0A++++++++++kube_running_pod_ready%0A++++++++++*%0A++++++++++on%28pod%2Cnamespace%29+group_left%28node%29+%28node_namespace_pod%3Akube_pod_info%3A%29%0A++++++++%29%0A++++++++*%0A++++++++on%28node%29+group_left%28role%29+%28max+by+%28node%29+%28kube_node_role%7Brole%3D%7E%22.%2B%22%7D%29%29%0A++++++%29%0A++++ response code: 422, error {"status":"error","errorType":"execution","error":"found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"alertmanager-main-1\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-194-133.ap-south-1.compute.internal\", pod=\"alertmanager-main-1\", prometheus=\"openshift-monitoring/k8s\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-128-68.ap-south-1.compute.internal\", pod=\"alertmanager-main-1\", prometheus=\"openshift-monitoring/k8s\"}];many-to-many matching not allowed: matching labels must be unique on one side"} # oc -n openshift-monitoring get pod -o wide | grep alertmanager-main-1 alertmanager-main-1 6/6 Running 0 26m 10.128.2.62 ip-10-0-128-68.ap-south-1.compute.internal <none> <none> (In reply to Junqi Zhao from comment #5) > also monitored in other cluster, if pods are rescheduled, not only for > promtheus-k8s pod, in the first a few minutes, I think almost 30 minutes, > the Pod count graph will not show, > > Request URL: > https://console-openshift-console.apps.qe-daily-0223.qe.devcluster.openshift. > com/api/prometheus/api/v1/query_range?start=1645599083.446&end=1645602683. > 446&step=60&query=%0A++++++count%28%0A++++++++%28%0A++++++++++kube_running_po > d_ready%0A++++++++++*%0A++++++++++on%28pod%2Cnamespace%29+group_left%28node%2 > 9+%28node_namespace_pod%3Akube_pod_info%3A%29%0A++++++++%29%0A++++++++*%0A+++ > +++++on%28node%29+group_left%28role%29+%28max+by+%28node%29+%28kube_node_role > %7Brole%3D%7E%22.%2B%22%7D%29%29%0A++++++%29%0A++++ maybe the first hour, we won't see the Pod count graph, the start to end time is 1 hour # date --date='@1645599083.446' -u Wed Feb 23 06:51:23 UTC 2022 # date --date='@1645602683.446' -u Wed Feb 23 07:51:23 UTC 2022 I think group_left is misused here. Normally the left should indicate that the left side has multiple metrics that should match with one on the right. In the query
****************************
count(
(
kube_running_pod_ready
*
on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:)
)
*
on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"}))
)
****************************
kube_running_pod_ready doesn't have a node label. This label comes from the right side. The error occurs when a pod is reported from two different nodes. Whether this is a staleness issue or an exporter bug, I think this query expression should be fixed.
Additionally the group_left(role) in the second part of the query is superfluous as no operant has a role label.
I think the following might fix the issue:
****************************
count(
(
kube_running_pod_ready
*
ignoring(node,uid) group_right node_namespace_pod:kube_pod_info:)
)
*
on(node) group_left (max by (node) (kube_node_role{role=~".+"}))
)
****************************
@juzhao Any chance you can help me checking the proposed fix? I have so far failed to reproduce this by deleting one or both prometheus-k8s pods. Does this maybe require another precondition? (In reply to Jan Fajerski from comment #9) > @juzhao Any chance you can help me checking the proposed fix? I > have so far failed to reproduce this by deleting one or both prometheus-k8s > pods. Does this maybe require another precondition? it seems it's not easy to reproduce, rescheduled prometheus-k8s-0 pod to another node, different with the node where it's scheduled on, did not reproduce this issue, will keep an eye on it. My intuition is that the many-to-many errors are due to staleness.
"node_namespace_pod:kube_pod_info:" returns 2 series for the same (pod,namespace) tuple: 1 for the node where the old pod was scheduled and 1 for the node where the new node is scheduled.
The usual fix is to add a surrounding "topk by(...) (1, ...)" clause grouping by the same labels as used by the on() join.
count(
(
kube_running_pod_ready
*
on(pod,namespace) group_left(node) (topk by(pod,namespace) (1,node_namespace_pod:kube_pod_info:))
)
*
on(node) group_left() (max by (node) (kube_node_role{role=~".+"}))
)
Jan is also right that "group_left(role)" for the last part of the query can be replaced by "group_left()" though I'm not even sure that it is needed since kube_node_role should always have a role label?
(In reply to Simon Pasquier from comment #11) > My intuition is that the many-to-many errors are due to staleness. > "node_namespace_pod:kube_pod_info:" returns 2 series for the same > (pod,namespace) tuple: 1 for the node where the old pod was scheduled and 1 > for the node where the new node is scheduled. > The usual fix is to add a surrounding "topk by(...) (1, ...)" clause > grouping by the same labels as used by the on() join. Right, I think in this case we can forego the topk though since we're not really interested in the label that causes the many-to-many error. @apickering Sorry, didn't finish the comment above. @anpicker Can you point me to where this query lives? Applying brute-force grep to openshift/console, I think it's here: https://github.com/openshift/console/blob/79236193e2bfd3bfa93929145b11d7c778c6ceda/frontend/packages/console-shared/src/promql/cluster-dashboard.ts#L375-L386 But it also seems that other queries would need to be sanitized. checked with 4.11.0-0.nightly-2022-06-15-222801, the graph for "Pod count" is consistent, no matter we reschedule prometheus pod to other node or not. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |