Bug 2057251

Summary:

response code for Pod count graph changed from 422 to 200 periodically for about 30 minutes if pod is rescheduled

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Management Console

Assignee:

Jan Fajerski <jfajersk>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.11

CC:

anpicker, aos-bugs, hongyli, spasquie

Target Milestone:

---

Target Release:

4.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-08-10 10:50:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
response code for Pod count graph - 422	none

Description Junqi Zhao 2022-02-23 04:29:05 UTC

Created attachment 1862787 [details]
response code for Pod count graph - 422

Description of problem:
cluster admin, login console, Overview page, "Cluster utilization" section, graph for Pod count changed from shown and disappeared periodically.
Pod count prometheus expr is:
****************************
      count(
        (
          kube_running_pod_ready
          *
          on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:)
        )
        *
        on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"}))
      )
****************************
checked the response, the response code for Pod count graph changed from 422 to 200 periodically, see 422 response code
Request URL: https://console-openshift-console.apps.qe-ui411-0223.qe.devcluster.openshift.com/api/prometheus/api/v1/query_range?start=1645585529.018&end=1645589129.018&step=60&query=%0A++++++count%28%0A++++++++%28%0A++++++++++kube_running_pod_ready%0A++++++++++*%0A++++++++++on%28pod%2Cnamespace%29+group_left%28node%29+%28node_namespace_pod%3Akube_pod_info%3A%29%0A++++++++%29%0A++++++++*%0A++++++++on%28node%29+group_left%28role%29+%28max+by+%28node%29+%28kube_node_role%7Brole%3D%7E%22.%2B%22%7D%29%29%0A++++++%29%0A++++
Request Method: GET
Status Code: 422 Unprocessable Entity
Remote Address: 10.68.5.41:3128
Referrer Policy: strict-origin-when-cross-origin

response is
{"status":"error","errorType":"execution","error":"found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"prometheus-k8s-0\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-163-32.us-east-2.compute.internal\", pod=\"prometheus-k8s-0\", prometheus=\"openshift-monitoring/k8s\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-147-176.us-east-2.compute.internal\", pod=\"prometheus-k8s-0\", prometheus=\"openshift-monitoring/k8s\"}];many-to-many matching not allowed: matching labels must be unique on one side"}

after about 30 seconds, the response is 200, could see the graph, see from "response code for Pod count graph - 200" picture
checked, reason is prometheus-k8s-0 killed and rescheduled from ip-10-0-163-32.us-east-2.compute.internal to another node ip-10-0-147-176.us-east-2.compute.internal
# oc -n openshift-monitoring get event | grep "Successfully assigned openshift-monitoring/prometheus-k8s-0"
176m        Normal    Scheduled                  pod/prometheus-k8s-0                                Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-163-32.us-east-2.compute.internal
163m        Normal    Scheduled                  pod/prometheus-k8s-0                                Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-163-32.us-east-2.compute.internal
27m         Normal    Scheduled                  pod/prometheus-k8s-0                                Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-147-176.us-east-2.compute.internal

# oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s
prometheus-k8s-0                               6/6     Running   0          27m    10.129.2.20    ip-10-0-147-176.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-1                               6/6     Running   0          164m   10.128.2.13    ip-10-0-197-160.us-east-2.compute.internal   <none>           <none>


continue to monitor for half an hour, there will not 424 error
# oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s
prometheus-k8s-0                               6/6     Running   0          65m     10.129.2.20    ip-10-0-147-176.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-1                               6/6     Running   0          3h22m   10.128.2.13    ip-10-0-197-160.us-east-2.compute.internal   <none>           <none>


Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-02-18-121223

How reproducible:
prometheus-k8s-0 killed and rescheduled

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
graph for Pod count changed from shown and disappeared periodically

Expected results:
graph for Pod count show should corretly

Additional info:
this is related to prometheus

Comment 5 Junqi Zhao 2022-02-23 08:05:45 UTC

also monitored in other cluster, if pods are rescheduled, not only for promtheus-k8s pod, in the first a few minutes, I think almost 30 minutes, the Pod count graph will not show, 

Request URL: https://console-openshift-console.apps.qe-daily-0223.qe.devcluster.openshift.com/api/prometheus/api/v1/query_range?start=1645599083.446&end=1645602683.446&step=60&query=%0A++++++count%28%0A++++++++%28%0A++++++++++kube_running_pod_ready%0A++++++++++*%0A++++++++++on%28pod%2Cnamespace%29+group_left%28node%29+%28node_namespace_pod%3Akube_pod_info%3A%29%0A++++++++%29%0A++++++++*%0A++++++++on%28node%29+group_left%28role%29+%28max+by+%28node%29+%28kube_node_role%7Brole%3D%7E%22.%2B%22%7D%29%29%0A++++++%29%0A++++

response code: 422, error
{"status":"error","errorType":"execution","error":"found duplicate series for the match group {namespace=\"openshift-monitoring\", pod=\"alertmanager-main-1\"} on the right hand-side of the operation: [{__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-194-133.ap-south-1.compute.internal\", pod=\"alertmanager-main-1\", prometheus=\"openshift-monitoring/k8s\"}, {__name__=\"node_namespace_pod:kube_pod_info:\", namespace=\"openshift-monitoring\", node=\"ip-10-0-128-68.ap-south-1.compute.internal\", pod=\"alertmanager-main-1\", prometheus=\"openshift-monitoring/k8s\"}];many-to-many matching not allowed: matching labels must be unique on one side"}

# oc -n openshift-monitoring get pod -o wide | grep alertmanager-main-1
alertmanager-main-1                            6/6     Running   0          26m   10.128.2.62    ip-10-0-128-68.ap-south-1.compute.internal    <none>           <none>

Comment 7 Junqi Zhao 2022-02-23 08:15:30 UTC

(In reply to Junqi Zhao from comment #5)
> also monitored in other cluster, if pods are rescheduled, not only for
> promtheus-k8s pod, in the first a few minutes, I think almost 30 minutes,
> the Pod count graph will not show, 
> 
> Request URL:
> https://console-openshift-console.apps.qe-daily-0223.qe.devcluster.openshift.
> com/api/prometheus/api/v1/query_range?start=1645599083.446&end=1645602683.
> 446&step=60&query=%0A++++++count%28%0A++++++++%28%0A++++++++++kube_running_po
> d_ready%0A++++++++++*%0A++++++++++on%28pod%2Cnamespace%29+group_left%28node%2
> 9+%28node_namespace_pod%3Akube_pod_info%3A%29%0A++++++++%29%0A++++++++*%0A+++
> +++++on%28node%29+group_left%28role%29+%28max+by+%28node%29+%28kube_node_role
> %7Brole%3D%7E%22.%2B%22%7D%29%29%0A++++++%29%0A++++

maybe the first hour, we won't see the Pod count graph, the start to end time is 1 hour
# date --date='@1645599083.446' -u
Wed Feb 23 06:51:23 UTC 2022
# date --date='@1645602683.446' -u
Wed Feb 23 07:51:23 UTC 2022

Comment 8 Jan Fajerski 2022-03-01 16:31:44 UTC

I think group_left is misused here. Normally the left should indicate that the left side has multiple metrics that should match with one on the right. In the query 
****************************
      count(
        (
          kube_running_pod_ready
          *
          on(pod,namespace) group_left(node) (node_namespace_pod:kube_pod_info:)
        )
        *
        on(node) group_left(role) (max by (node) (kube_node_role{role=~".+"}))
      )
****************************
kube_running_pod_ready doesn't have a node label. This label comes from the right side. The error occurs when a pod is reported from two different nodes. Whether this is a staleness issue or an exporter bug, I think this query expression should be fixed.
Additionally the group_left(role) in the second part of the query is superfluous as no operant has a role label.

I think the following might fix the issue:
****************************
      count(
        (
          kube_running_pod_ready
          *
          ignoring(node,uid) group_right node_namespace_pod:kube_pod_info:)
        )
        *
        on(node) group_left (max by (node) (kube_node_role{role=~".+"}))
      )
****************************

Comment 9 Jan Fajerski 2022-03-02 07:58:27 UTC

@juzhao Any chance you can help me checking the proposed fix? I have so far failed to reproduce this by deleting one or both prometheus-k8s pods. Does this maybe require another precondition?

Comment 10 Junqi Zhao 2022-03-02 10:09:36 UTC

(In reply to Jan Fajerski from comment #9)
> @juzhao Any chance you can help me checking the proposed fix? I
> have so far failed to reproduce this by deleting one or both prometheus-k8s
> pods. Does this maybe require another precondition?

it seems it's not easy to reproduce, rescheduled prometheus-k8s-0 pod to another node, different with the node where it's scheduled on, did not reproduce this issue, will keep an eye on it.

Comment 11 Simon Pasquier 2022-03-03 10:31:59 UTC

My intuition is that the many-to-many errors are due to staleness.
"node_namespace_pod:kube_pod_info:" returns 2 series for the same (pod,namespace) tuple: 1 for the node where the old pod was scheduled and 1 for the node where the new node is scheduled.
The usual fix is to add a surrounding "topk by(...) (1, ...)" clause grouping by the same labels as used by the on() join.



      count(
        (
          kube_running_pod_ready
          *
          on(pod,namespace) group_left(node) (topk by(pod,namespace) (1,node_namespace_pod:kube_pod_info:))
        )
        *
        on(node) group_left() (max by (node) (kube_node_role{role=~".+"}))
      )

Jan is also right that "group_left(role)" for the last part of the query can be replaced by "group_left()" though I'm not even sure that it is needed since kube_node_role should always have a role label?

Comment 12 Jan Fajerski 2022-03-23 15:55:37 UTC

(In reply to Simon Pasquier from comment #11)
> My intuition is that the many-to-many errors are due to staleness.
> "node_namespace_pod:kube_pod_info:" returns 2 series for the same
> (pod,namespace) tuple: 1 for the node where the old pod was scheduled and 1
> for the node where the new node is scheduled.
> The usual fix is to add a surrounding "topk by(...) (1, ...)" clause
> grouping by the same labels as used by the on() join.

Right, I think in this case we can forego the topk though since we're not really interested in the label that causes the many-to-many error.

@apickering

Comment 13 Jan Fajerski 2022-03-23 15:58:40 UTC

Sorry, didn't finish the comment above.

@anpicker Can you point me to where this query lives?

Comment 14 Simon Pasquier 2022-03-23 16:43:24 UTC

Applying brute-force grep to openshift/console, I think it's here: https://github.com/openshift/console/blob/79236193e2bfd3bfa93929145b11d7c778c6ceda/frontend/packages/console-shared/src/promql/cluster-dashboard.ts#L375-L386

But it also seems that other queries would need to be sanitized.

Comment 18 Junqi Zhao 2022-06-20 09:55:49 UTC

checked with 4.11.0-0.nightly-2022-06-15-222801, the graph for "Pod count" is consistent, no matter we reschedule prometheus pod to other node or not.

Comment 19 errata-xmlrpc 2022-08-10 10:50:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069