Bug 2003788 - CSR reconciler report error constantly when BYOH CSR approved by other Approver
Summary: CSR reconciler report error constantly when BYOH CSR approved by other Approver
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Joel Speed
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-13 17:31 UTC by Joel Speed
Modified: 2022-03-10 16:10 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 2002961
Environment:
Last Closed: 2022-03-10 16:10:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-machine-approver pull 129 0 None open Bug 2003788: Prevent error loop when a CSR is queued and then approved externally 2021-09-13 17:42:55 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:10:28 UTC

Comment 2 Milind Yadav 2021-10-07 03:14:41 UTC
Installed cluster with windows workers for more than 20 hrs , did not see any errors in logs 

[miyadav@miyadav ~]$ oc get clusterversion
oNAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-05-151518   True        False         20h     Cluster version is 4.10.0-0.nightly-2021-10-05-151518
[miyadav@miyadav ~]$ 
[miyadav@miyadav ~]$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-142-79.us-east-2.compute.internal    Ready    worker   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-150-207.us-east-2.compute.internal   Ready    master   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-158-236.us-east-2.compute.internal   Ready    worker   20h   v1.22.1-1660+bbcc9aea9e4bef
ip-10-0-159-91.us-east-2.compute.internal    Ready    worker   19h   v1.22.1-1660+bbcc9aea9e4bef
ip-10-0-170-225.us-east-2.compute.internal   Ready    master   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-175-245.us-east-2.compute.internal   Ready    worker   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-223-46.us-east-2.compute.internal    Ready    master   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-223-51.us-east-2.compute.internal    Ready    worker   20h   v1.22.0-rc.0+1bcce0f


[miyadav@miyadav ~]$ oc get machines -o wide -n openshift-machine-api
NAME                                                 PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
miyadav-0610-4qqzq-master-0                          Running   m5.xlarge   us-east-2   us-east-2a   20h   ip-10-0-150-207.us-east-2.compute.internal   aws:///us-east-2a/i-065e32297436ece3c   running
miyadav-0610-4qqzq-master-1                          Running   m5.xlarge   us-east-2   us-east-2b   20h   ip-10-0-170-225.us-east-2.compute.internal   aws:///us-east-2b/i-07402e59ded8fa897   running
miyadav-0610-4qqzq-master-2                          Running   m5.xlarge   us-east-2   us-east-2c   20h   ip-10-0-223-46.us-east-2.compute.internal    aws:///us-east-2c/i-0357a86d2114a71f1   running
miyadav-0610-4qqzq-windows-worker-us-east-2a-gx658   Running   m5a.large   us-east-2   us-east-2a   20h   ip-10-0-159-91.us-east-2.compute.internal    aws:///us-east-2a/i-07de282a2c8a934bd   running
miyadav-0610-4qqzq-windows-worker-us-east-2a-tbfcm   Running   m5a.large   us-east-2   us-east-2a   20h   ip-10-0-158-236.us-east-2.compute.internal   aws:///us-east-2a/i-06a6c0806ffef04d4   running
miyadav-0610-4qqzq-worker-us-east-2a-8p7ns           Running   m5.large    us-east-2   us-east-2a   20h   ip-10-0-142-79.us-east-2.compute.internal    aws:///us-east-2a/i-01795b687e454c275   running
miyadav-0610-4qqzq-worker-us-east-2b-c6bqd           Running   m5.large    us-east-2   us-east-2b   20h   ip-10-0-175-245.us-east-2.compute.internal   aws:///us-east-2b/i-07c93ea96d93efebe   running
miyadav-0610-4qqzq-worker-us-east-2c-nqj9k           Running   m5.large    us-east-2   us-east-2c   20h   ip-10-0-223-51.us-east-2.compute.internal    aws:///us-east-2c/i-03cc509adef21fe63   running


logs : oc logs  windows-machine-config-operator-568d65f59f-qvxtw
https://privatebin-it-iso.int.open.paas.redhat.com/?341fab9d51cc2ac9#GdwgbQWWHEosBxcHwEfLEGrLdbuGqpnuF4GJAYWyvq93


@joel , Please help to review logs , also , looks like DEBUG being enabled , will cause issues later .

Comment 3 Milind Yadav 2021-10-07 03:25:37 UTC
Adding more logs for comment#2 (windows csrs)- 

[miyadav@miyadav ~]$ oc get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                REQUESTEDDURATION   CONDITION
csr-72257   60m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-223-46.us-east-2.compute.internal    <none>              Approved,Issued
csr-778f6   52m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-142-79.us-east-2.compute.internal    <none>              Approved,Issued
csr-kskdc   65m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-170-225.us-east-2.compute.internal   <none>              Approved,Issued
csr-qkvsm   61m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-159-91.us-east-2.compute.internal    <none>              Approved,Issued
csr-shnn2   77m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-150-207.us-east-2.compute.internal   <none>              Approved,Issued
csr-wjbnm   78m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-175-245.us-east-2.compute.internal   <none>              Approved,Issued

oc logs deployment.apps/machine-approver machine-approver-controller -n openshift-cluster-machine-approver

https://privatebin-it-iso.int.open.paas.redhat.com/?ee2d69bec6c1c7be#DAju7NuwFJ4tnyEjUVt8mwszsuQStuhuoLfL3aPn1hKk 

Another windows worker csr logs - 
https://privatebin-it-iso.int.open.paas.redhat.com/?93211568b74d08f8#HJ6bzDP4SLy1HA7zbifUqAGPLv6hPVHzHs24c7an56uQ

Comment 4 Joel Speed 2021-10-07 12:03:09 UTC
I can't see any issues within the logs, but I'm not sure it actually executed the code path. Reproducing this issue is very very difficult as we need to queue the CSR and then have it approved. I'm not sure if it's worth our time trying to hunt an explicit event where this happens. It's likely to be sporadic and needs the WMCO and CSR approver running simultaneously and then it might happen when we add a new windows machine

Comment 5 Milind Yadav 2021-10-08 07:43:30 UTC
Thanks @jspeed for review , Also , need input on debug messages .. I mean granuality of messages in logs - https://privatebin-it-iso.int.open.paas.redhat.com/?341fab9d51cc2ac9#GdwgbQWWHEosBxcHwEfLEGrLdbuGqpnuF4GJAYWyvq93 .

Comment 6 Joel Speed 2021-10-08 10:10:41 UTC
Those logs are part of the WMCO, probably best to check with them about how much they are logging. I don't think that necessarily affects this bug, but might be worth asking on https://bugzilla.redhat.com/show_bug.cgi?id=2002961 in which the WMCO team are solving the same sort of issue as here

Comment 9 Milind Yadav 2021-10-14 02:21:31 UTC
Thanks Joel and Mansi for comments , moving to VERIFIED.

Comment 12 errata-xmlrpc 2022-03-10 16:10:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.