Bug 2003788

Summary: CSR reconciler report error constantly when BYOH CSR approved by other Approver
Product: OpenShift Container Platform Reporter: Joel Speed <jspeed>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, mankulka, mohashai, sgao, team-winc
Version: 4.9   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2002961 Environment:
Last Closed: 2022-03-10 16:10:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Milind Yadav 2021-10-07 03:14:41 UTC
Installed cluster with windows workers for more than 20 hrs , did not see any errors in logs 

[miyadav@miyadav ~]$ oc get clusterversion
oNAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-05-151518   True        False         20h     Cluster version is 4.10.0-0.nightly-2021-10-05-151518
[miyadav@miyadav ~]$ 
[miyadav@miyadav ~]$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-142-79.us-east-2.compute.internal    Ready    worker   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-150-207.us-east-2.compute.internal   Ready    master   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-158-236.us-east-2.compute.internal   Ready    worker   20h   v1.22.1-1660+bbcc9aea9e4bef
ip-10-0-159-91.us-east-2.compute.internal    Ready    worker   19h   v1.22.1-1660+bbcc9aea9e4bef
ip-10-0-170-225.us-east-2.compute.internal   Ready    master   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-175-245.us-east-2.compute.internal   Ready    worker   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-223-46.us-east-2.compute.internal    Ready    master   20h   v1.22.0-rc.0+1bcce0f
ip-10-0-223-51.us-east-2.compute.internal    Ready    worker   20h   v1.22.0-rc.0+1bcce0f


[miyadav@miyadav ~]$ oc get machines -o wide -n openshift-machine-api
NAME                                                 PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
miyadav-0610-4qqzq-master-0                          Running   m5.xlarge   us-east-2   us-east-2a   20h   ip-10-0-150-207.us-east-2.compute.internal   aws:///us-east-2a/i-065e32297436ece3c   running
miyadav-0610-4qqzq-master-1                          Running   m5.xlarge   us-east-2   us-east-2b   20h   ip-10-0-170-225.us-east-2.compute.internal   aws:///us-east-2b/i-07402e59ded8fa897   running
miyadav-0610-4qqzq-master-2                          Running   m5.xlarge   us-east-2   us-east-2c   20h   ip-10-0-223-46.us-east-2.compute.internal    aws:///us-east-2c/i-0357a86d2114a71f1   running
miyadav-0610-4qqzq-windows-worker-us-east-2a-gx658   Running   m5a.large   us-east-2   us-east-2a   20h   ip-10-0-159-91.us-east-2.compute.internal    aws:///us-east-2a/i-07de282a2c8a934bd   running
miyadav-0610-4qqzq-windows-worker-us-east-2a-tbfcm   Running   m5a.large   us-east-2   us-east-2a   20h   ip-10-0-158-236.us-east-2.compute.internal   aws:///us-east-2a/i-06a6c0806ffef04d4   running
miyadav-0610-4qqzq-worker-us-east-2a-8p7ns           Running   m5.large    us-east-2   us-east-2a   20h   ip-10-0-142-79.us-east-2.compute.internal    aws:///us-east-2a/i-01795b687e454c275   running
miyadav-0610-4qqzq-worker-us-east-2b-c6bqd           Running   m5.large    us-east-2   us-east-2b   20h   ip-10-0-175-245.us-east-2.compute.internal   aws:///us-east-2b/i-07c93ea96d93efebe   running
miyadav-0610-4qqzq-worker-us-east-2c-nqj9k           Running   m5.large    us-east-2   us-east-2c   20h   ip-10-0-223-51.us-east-2.compute.internal    aws:///us-east-2c/i-03cc509adef21fe63   running


logs : oc logs  windows-machine-config-operator-568d65f59f-qvxtw
https://privatebin-it-iso.int.open.paas.redhat.com/?341fab9d51cc2ac9#GdwgbQWWHEosBxcHwEfLEGrLdbuGqpnuF4GJAYWyvq93


@joel , Please help to review logs , also , looks like DEBUG being enabled , will cause issues later .

Comment 3 Milind Yadav 2021-10-07 03:25:37 UTC
Adding more logs for comment#2 (windows csrs)- 

[miyadav@miyadav ~]$ oc get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                REQUESTEDDURATION   CONDITION
csr-72257   60m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-223-46.us-east-2.compute.internal    <none>              Approved,Issued
csr-778f6   52m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-142-79.us-east-2.compute.internal    <none>              Approved,Issued
csr-kskdc   65m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-170-225.us-east-2.compute.internal   <none>              Approved,Issued
csr-qkvsm   61m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-159-91.us-east-2.compute.internal    <none>              Approved,Issued
csr-shnn2   77m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-150-207.us-east-2.compute.internal   <none>              Approved,Issued
csr-wjbnm   78m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-175-245.us-east-2.compute.internal   <none>              Approved,Issued

oc logs deployment.apps/machine-approver machine-approver-controller -n openshift-cluster-machine-approver

https://privatebin-it-iso.int.open.paas.redhat.com/?ee2d69bec6c1c7be#DAju7NuwFJ4tnyEjUVt8mwszsuQStuhuoLfL3aPn1hKk 

Another windows worker csr logs - 
https://privatebin-it-iso.int.open.paas.redhat.com/?93211568b74d08f8#HJ6bzDP4SLy1HA7zbifUqAGPLv6hPVHzHs24c7an56uQ

Comment 4 Joel Speed 2021-10-07 12:03:09 UTC
I can't see any issues within the logs, but I'm not sure it actually executed the code path. Reproducing this issue is very very difficult as we need to queue the CSR and then have it approved. I'm not sure if it's worth our time trying to hunt an explicit event where this happens. It's likely to be sporadic and needs the WMCO and CSR approver running simultaneously and then it might happen when we add a new windows machine

Comment 5 Milind Yadav 2021-10-08 07:43:30 UTC
Thanks @jspeed for review , Also , need input on debug messages .. I mean granuality of messages in logs - https://privatebin-it-iso.int.open.paas.redhat.com/?341fab9d51cc2ac9#GdwgbQWWHEosBxcHwEfLEGrLdbuGqpnuF4GJAYWyvq93 .

Comment 6 Joel Speed 2021-10-08 10:10:41 UTC
Those logs are part of the WMCO, probably best to check with them about how much they are logging. I don't think that necessarily affects this bug, but might be worth asking on https://bugzilla.redhat.com/show_bug.cgi?id=2002961 in which the WMCO team are solving the same sort of issue as here

Comment 9 Milind Yadav 2021-10-14 02:21:31 UTC
Thanks Joel and Mansi for comments , moving to VERIFIED.

Comment 12 errata-xmlrpc 2022-03-10 16:10:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056