Installed cluster with windows workers for more than 20 hrs , did not see any errors in logs [miyadav@miyadav ~]$ oc get clusterversion oNAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-10-05-151518 True False 20h Cluster version is 4.10.0-0.nightly-2021-10-05-151518 [miyadav@miyadav ~]$ [miyadav@miyadav ~]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-142-79.us-east-2.compute.internal Ready worker 20h v1.22.0-rc.0+1bcce0f ip-10-0-150-207.us-east-2.compute.internal Ready master 20h v1.22.0-rc.0+1bcce0f ip-10-0-158-236.us-east-2.compute.internal Ready worker 20h v1.22.1-1660+bbcc9aea9e4bef ip-10-0-159-91.us-east-2.compute.internal Ready worker 19h v1.22.1-1660+bbcc9aea9e4bef ip-10-0-170-225.us-east-2.compute.internal Ready master 20h v1.22.0-rc.0+1bcce0f ip-10-0-175-245.us-east-2.compute.internal Ready worker 20h v1.22.0-rc.0+1bcce0f ip-10-0-223-46.us-east-2.compute.internal Ready master 20h v1.22.0-rc.0+1bcce0f ip-10-0-223-51.us-east-2.compute.internal Ready worker 20h v1.22.0-rc.0+1bcce0f [miyadav@miyadav ~]$ oc get machines -o wide -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE miyadav-0610-4qqzq-master-0 Running m5.xlarge us-east-2 us-east-2a 20h ip-10-0-150-207.us-east-2.compute.internal aws:///us-east-2a/i-065e32297436ece3c running miyadav-0610-4qqzq-master-1 Running m5.xlarge us-east-2 us-east-2b 20h ip-10-0-170-225.us-east-2.compute.internal aws:///us-east-2b/i-07402e59ded8fa897 running miyadav-0610-4qqzq-master-2 Running m5.xlarge us-east-2 us-east-2c 20h ip-10-0-223-46.us-east-2.compute.internal aws:///us-east-2c/i-0357a86d2114a71f1 running miyadav-0610-4qqzq-windows-worker-us-east-2a-gx658 Running m5a.large us-east-2 us-east-2a 20h ip-10-0-159-91.us-east-2.compute.internal aws:///us-east-2a/i-07de282a2c8a934bd running miyadav-0610-4qqzq-windows-worker-us-east-2a-tbfcm Running m5a.large us-east-2 us-east-2a 20h ip-10-0-158-236.us-east-2.compute.internal aws:///us-east-2a/i-06a6c0806ffef04d4 running miyadav-0610-4qqzq-worker-us-east-2a-8p7ns Running m5.large us-east-2 us-east-2a 20h ip-10-0-142-79.us-east-2.compute.internal aws:///us-east-2a/i-01795b687e454c275 running miyadav-0610-4qqzq-worker-us-east-2b-c6bqd Running m5.large us-east-2 us-east-2b 20h ip-10-0-175-245.us-east-2.compute.internal aws:///us-east-2b/i-07c93ea96d93efebe running miyadav-0610-4qqzq-worker-us-east-2c-nqj9k Running m5.large us-east-2 us-east-2c 20h ip-10-0-223-51.us-east-2.compute.internal aws:///us-east-2c/i-03cc509adef21fe63 running logs : oc logs windows-machine-config-operator-568d65f59f-qvxtw https://privatebin-it-iso.int.open.paas.redhat.com/?341fab9d51cc2ac9#GdwgbQWWHEosBxcHwEfLEGrLdbuGqpnuF4GJAYWyvq93 @joel , Please help to review logs , also , looks like DEBUG being enabled , will cause issues later .
Adding more logs for comment#2 (windows csrs)- [miyadav@miyadav ~]$ oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-72257 60m kubernetes.io/kubelet-serving system:node:ip-10-0-223-46.us-east-2.compute.internal <none> Approved,Issued csr-778f6 52m kubernetes.io/kubelet-serving system:node:ip-10-0-142-79.us-east-2.compute.internal <none> Approved,Issued csr-kskdc 65m kubernetes.io/kube-apiserver-client-kubelet system:node:ip-10-0-170-225.us-east-2.compute.internal <none> Approved,Issued csr-qkvsm 61m kubernetes.io/kubelet-serving system:node:ip-10-0-159-91.us-east-2.compute.internal <none> Approved,Issued csr-shnn2 77m kubernetes.io/kube-apiserver-client-kubelet system:node:ip-10-0-150-207.us-east-2.compute.internal <none> Approved,Issued csr-wjbnm 78m kubernetes.io/kube-apiserver-client-kubelet system:node:ip-10-0-175-245.us-east-2.compute.internal <none> Approved,Issued oc logs deployment.apps/machine-approver machine-approver-controller -n openshift-cluster-machine-approver https://privatebin-it-iso.int.open.paas.redhat.com/?ee2d69bec6c1c7be#DAju7NuwFJ4tnyEjUVt8mwszsuQStuhuoLfL3aPn1hKk Another windows worker csr logs - https://privatebin-it-iso.int.open.paas.redhat.com/?93211568b74d08f8#HJ6bzDP4SLy1HA7zbifUqAGPLv6hPVHzHs24c7an56uQ
I can't see any issues within the logs, but I'm not sure it actually executed the code path. Reproducing this issue is very very difficult as we need to queue the CSR and then have it approved. I'm not sure if it's worth our time trying to hunt an explicit event where this happens. It's likely to be sporadic and needs the WMCO and CSR approver running simultaneously and then it might happen when we add a new windows machine
Thanks @jspeed for review , Also , need input on debug messages .. I mean granuality of messages in logs - https://privatebin-it-iso.int.open.paas.redhat.com/?341fab9d51cc2ac9#GdwgbQWWHEosBxcHwEfLEGrLdbuGqpnuF4GJAYWyvq93 .
Those logs are part of the WMCO, probably best to check with them about how much they are logging. I don't think that necessarily affects this bug, but might be worth asking on https://bugzilla.redhat.com/show_bug.cgi?id=2002961 in which the WMCO team are solving the same sort of issue as here
Thanks Joel and Mansi for comments , moving to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056