Description of problem: We are seeing occasional failures where WMCO is unable to approve a CSR. CSR approval (and any subsequent content validation) fails because WMCO tries to update the CSR resource at the same time as another entity. Version-Release number of selected component (if applicable): WMCO: built from master@7b42724c OCP: 4.10.0-0.nightly-2021-12-03-213835 How reproducible: sometimes Steps to Reproduce: 1. add BYOH Windows VM to windows-instances ConfigMap 2. wait for WMCO to start confiuring the VM as a node 3. watch CSR controller logs when processing CSR related to this node Actual results: CSR approval update fails with the following error -- "error": "WMCO CSR Approver could not approve csr-wb56c CSR: could not update conditions for approval CSR: csr-wb56c: Operation cannot be fulfilled on certificatesigningrequests.certificates.k8s.io \"csr-wb56c\": the object has been modified; please apply your changes to the latest version and try again" Expected results: expected BYOH node CSR to be approved by WMCO CSR approver Additional info: example failed job - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/843/pull-ci-openshift-windows-machine-config-operator-master-azure-e2e-operator/1467643330635501568 - build logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/843/pull-ci-openshift-windows-machine-config-operator-master-azure-e2e-operator/1467643330635501568/artifacts/azure-e2e-operator/windows-e2e-operator-test/build-log.txt
Target Release adjusted.
Something is restarting the machine approver while machines are being provisioned. This is the cause of the CSR approval conflicts, where the cluster-machine-approver does not stay scaled down so it approves Node CSRs for even our BYOH nodes in CI (which are Machine-backed w/o os=Windows label via machine-api for convenience). We first add a "unmanaged":true override so CVO doesn't stomp back the replica count, then scale down the openshift-cluster-machine-approver/machine-approver deployment but this does not seem like it is enough. Within a few seconds, the machine approver pods are back up. The following log added to the e2e tests shows this in both failures and successful CI runs (hinting at a race between WMCO CSR approver and the machine approver controller): 2021/12/22 18:58:47 scaling deployment openshift-cluster-machine-approver/machine-approver... currently at 1 replicas, scaling to 0 replicas 2021/12/22 18:59:02 done scaling deployment openshift-cluster-machine-approver/machine-approver... currently at 0 replicas, expected 0 replicas 2021/12/22 18:59:03 at start of waitForWindowsMachines()... deployment openshift-cluster-machine-approver/machine-approver currently at 0 replicas, expected 0 replicas 2021/12/22 18:59:08 at end of waitForWindowsMachines()... deployment openshift-cluster-machine-approver/machine-approver currently at 1 replicas, expected 0 replicas This issue is not present in older nightly builds, e.g. 4.10.0-0.nightly-2021-12-01-164437, but is reproducible in newer ones like 4.10.0-0.nightly-2021-12-20-231053.
There was some logic added to CVO [0] that relies on the override's group field. But there was an error in the CVO docs [1] around this group field, which we likely followed when creating our override patch data. So when that change went in to CVO it broke how CVO interprets the override (basically ignores it despite the override existing). The fix [2] is in using the proper group value. [0] https://github.com/openshift/cluster-version-operator/pull/689/files#diff-2010b5bb18e3579c7c8a1c79ab439955a723894f85549689a4401790b0315f00R1063-R1064 [1] https://github.com/openshift/enhancements/commit/6491ef6057d8d5167742e4d5390c21c7c324b2b3 [2] https://github.com/openshift/windows-machine-config-operator/pull/844/commits/78978857eff4fd029fbe34a604c18505800eca6a
Verified on 660f5a3 No such error occur while bootstrapping BYOH node controllers.certificatesigningrequests CSR approved {"CSR": "csr-xblsq"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Windows Container Support for Red Hat OpenShift 5.0.0 [security update]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0577