Bug 2032048 - CSR approval failures caused by update conflicts
Summary: CSR approval failures caused by update conflicts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Mohammad Saif Shaikh
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On:
Blocks: 2057161
TreeView+ depends on / blocked
 
Reported: 2021-12-13 22:41 UTC by Mohammad Saif Shaikh
Modified: 2022-03-28 09:36 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: An entity other than WMCO modified CertificateSigningRequest resources associated with Windows BYOH nodes Consequence: WMCO would have a stale reference to the CSR and would be unable to approve it Fix: Retry CSR approval until a specified timeout if an update conflict is detected Result: CSR processing goes through cleanly when unrelated changes to BYOH CSRs are made by entities external to WMCO
Clone Of:
Environment:
Last Closed: 2022-03-28 09:36:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift windows-machine-config-operator pull 844 0 None open Bug 2032048: Retry Update of CSRs approval status 2021-12-14 18:46:46 UTC
Red Hat Product Errata RHSA-2022:0577 0 None None None 2022-03-28 09:36:45 UTC

Description Mohammad Saif Shaikh 2021-12-13 22:41:43 UTC
Description of problem:
We are seeing occasional failures where WMCO is unable to approve a CSR. CSR approval (and any subsequent content validation) fails because WMCO tries to update the CSR resource at the same time as another entity.

Version-Release number of selected component (if applicable):
WMCO: built from master@7b42724c
OCP: 4.10.0-0.nightly-2021-12-03-213835

How reproducible:
sometimes

Steps to Reproduce:
1. add BYOH Windows VM to windows-instances ConfigMap
2. wait for WMCO to start confiuring the VM as a node
3. watch CSR controller logs when processing CSR related to this node

Actual results:
CSR approval update fails with the following error --
"error": "WMCO CSR Approver could not approve csr-wb56c CSR: could not update conditions for approval CSR: csr-wb56c: Operation cannot be fulfilled on certificatesigningrequests.certificates.k8s.io \"csr-wb56c\": the object has been modified; please apply your changes to the latest version and try again"

Expected results:
expected BYOH node CSR to be approved by WMCO CSR approver

Additional info:
example failed job 
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/843/pull-ci-openshift-windows-machine-config-operator-master-azure-e2e-operator/1467643330635501568
- build logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/843/pull-ci-openshift-windows-machine-config-operator-master-azure-e2e-operator/1467643330635501568/artifacts/azure-e2e-operator/windows-e2e-operator-test/build-log.txt

Comment 1 jvaldes 2021-12-14 18:46:19 UTC
Target Release adjusted.

Comment 3 Mohammad Saif Shaikh 2021-12-23 15:50:05 UTC
Something is restarting the machine approver while machines are being provisioned. This is the cause of the CSR approval conflicts, where the cluster-machine-approver does not stay scaled down so it approves Node CSRs for even our BYOH nodes in CI (which are Machine-backed w/o os=Windows label via machine-api for convenience).

We first add a "unmanaged":true override so CVO doesn't stomp back the replica count, then scale down the openshift-cluster-machine-approver/machine-approver deployment but this does not seem like it is enough. Within a few seconds, the machine approver pods are back up. The following log added to the e2e tests shows this in both failures and successful CI runs (hinting at a race between WMCO CSR approver and the machine approver controller):
2021/12/22 18:58:47 scaling deployment openshift-cluster-machine-approver/machine-approver... currently at 1 replicas, scaling to 0 replicas
2021/12/22 18:59:02 done scaling deployment openshift-cluster-machine-approver/machine-approver... currently at 0 replicas, expected 0 replicas
2021/12/22 18:59:03 at start of waitForWindowsMachines()... deployment openshift-cluster-machine-approver/machine-approver currently at 0 replicas, expected 0 replicas
2021/12/22 18:59:08 at end of waitForWindowsMachines()... deployment openshift-cluster-machine-approver/machine-approver currently at 1 replicas, expected 0 replicas

This issue is not present in older nightly builds, e.g. 4.10.0-0.nightly-2021-12-01-164437, but is reproducible in newer ones like 4.10.0-0.nightly-2021-12-20-231053.

Comment 4 Mohammad Saif Shaikh 2022-01-06 15:27:27 UTC
There was some logic added to CVO [0] that relies on the override's group field. But there was an error in the CVO docs [1] around this group field, which we likely followed when creating our override patch data. So when that change went in to CVO it broke how CVO interprets the override (basically ignores it despite the override existing). The fix [2] is in using the proper group value.

[0] https://github.com/openshift/cluster-version-operator/pull/689/files#diff-2010b5bb18e3579c7c8a1c79ab439955a723894f85549689a4401790b0315f00R1063-R1064
[1] https://github.com/openshift/enhancements/commit/6491ef6057d8d5167742e4d5390c21c7c324b2b3
[2] https://github.com/openshift/windows-machine-config-operator/pull/844/commits/78978857eff4fd029fbe34a604c18505800eca6a

Comment 5 Ronnie Rasouli 2022-01-27 15:23:48 UTC
Verified on 660f5a3
No such error occur while bootstrapping BYOH node
controllers.certificatesigningrequests	CSR approved	{"CSR": "csr-xblsq"}

Comment 8 errata-xmlrpc 2022-03-28 09:36:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Windows Container Support for Red Hat OpenShift 5.0.0 [security update]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0577


Note You need to log in before you can comment on or make changes to this bug.