Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2032048

Summary:	CSR approval failures caused by update conflicts
Product:	OpenShift Container Platform	Reporter:	Mohammad Saif Shaikh <mohashai>
Component:	Windows Containers	Assignee:	Mohammad Saif Shaikh <mohashai>
Status:	CLOSED ERRATA	QA Contact:	Ronnie Rasouli <rrasouli>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.10	CC:	aos-bugs, jvaldes, rrasouli
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: An entity other than WMCO modified CertificateSigningRequest resources associated with Windows BYOH nodes Consequence: WMCO would have a stale reference to the CSR and would be unable to approve it Fix: Retry CSR approval until a specified timeout if an update conflict is detected Result: CSR processing goes through cleanly when unrelated changes to BYOH CSRs are made by entities external to WMCO	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-28 09:36:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2057161

Description Mohammad Saif Shaikh 2021-12-13 22:41:43 UTC

Description of problem:
We are seeing occasional failures where WMCO is unable to approve a CSR. CSR approval (and any subsequent content validation) fails because WMCO tries to update the CSR resource at the same time as another entity.

Version-Release number of selected component (if applicable):
WMCO: built from master@7b42724c
OCP: 4.10.0-0.nightly-2021-12-03-213835

How reproducible:
sometimes

Steps to Reproduce:
1. add BYOH Windows VM to windows-instances ConfigMap
2. wait for WMCO to start confiuring the VM as a node
3. watch CSR controller logs when processing CSR related to this node

Actual results:
CSR approval update fails with the following error --
"error": "WMCO CSR Approver could not approve csr-wb56c CSR: could not update conditions for approval CSR: csr-wb56c: Operation cannot be fulfilled on certificatesigningrequests.certificates.k8s.io \"csr-wb56c\": the object has been modified; please apply your changes to the latest version and try again"

Expected results:
expected BYOH node CSR to be approved by WMCO CSR approver

Additional info:
example failed job 
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/843/pull-ci-openshift-windows-machine-config-operator-master-azure-e2e-operator/1467643330635501568
- build logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/843/pull-ci-openshift-windows-machine-config-operator-master-azure-e2e-operator/1467643330635501568/artifacts/azure-e2e-operator/windows-e2e-operator-test/build-log.txt

Comment 1 jvaldes 2021-12-14 18:46:19 UTC

Target Release adjusted.

Comment 3 Mohammad Saif Shaikh 2021-12-23 15:50:05 UTC

Something is restarting the machine approver while machines are being provisioned. This is the cause of the CSR approval conflicts, where the cluster-machine-approver does not stay scaled down so it approves Node CSRs for even our BYOH nodes in CI (which are Machine-backed w/o os=Windows label via machine-api for convenience).

We first add a "unmanaged":true override so CVO doesn't stomp back the replica count, then scale down the openshift-cluster-machine-approver/machine-approver deployment but this does not seem like it is enough. Within a few seconds, the machine approver pods are back up. The following log added to the e2e tests shows this in both failures and successful CI runs (hinting at a race between WMCO CSR approver and the machine approver controller):
2021/12/22 18:58:47 scaling deployment openshift-cluster-machine-approver/machine-approver... currently at 1 replicas, scaling to 0 replicas
2021/12/22 18:59:02 done scaling deployment openshift-cluster-machine-approver/machine-approver... currently at 0 replicas, expected 0 replicas
2021/12/22 18:59:03 at start of waitForWindowsMachines()... deployment openshift-cluster-machine-approver/machine-approver currently at 0 replicas, expected 0 replicas
2021/12/22 18:59:08 at end of waitForWindowsMachines()... deployment openshift-cluster-machine-approver/machine-approver currently at 1 replicas, expected 0 replicas

This issue is not present in older nightly builds, e.g. 4.10.0-0.nightly-2021-12-01-164437, but is reproducible in newer ones like 4.10.0-0.nightly-2021-12-20-231053.

Comment 4 Mohammad Saif Shaikh 2022-01-06 15:27:27 UTC

There was some logic added to CVO [0] that relies on the override's group field. But there was an error in the CVO docs [1] around this group field, which we likely followed when creating our override patch data. So when that change went in to CVO it broke how CVO interprets the override (basically ignores it despite the override existing). The fix [2] is in using the proper group value.

[0] https://github.com/openshift/cluster-version-operator/pull/689/files#diff-2010b5bb18e3579c7c8a1c79ab439955a723894f85549689a4401790b0315f00R1063-R1064
[1] https://github.com/openshift/enhancements/commit/6491ef6057d8d5167742e4d5390c21c7c324b2b3
[2] https://github.com/openshift/windows-machine-config-operator/pull/844/commits/78978857eff4fd029fbe34a604c18505800eca6a

Comment 5 Ronnie Rasouli 2022-01-27 15:23:48 UTC

Verified on 660f5a3
No such error occur while bootstrapping BYOH node
controllers.certificatesigningrequests	CSR approved	{"CSR": "csr-xblsq"}

Comment 8 errata-xmlrpc 2022-03-28 09:36:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Windows Container Support for Red Hat OpenShift 5.0.0 [security update]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0577