Bug 2032048
| Summary: | CSR approval failures caused by update conflicts | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mohammad Saif Shaikh <mohashai> |
| Component: | Windows Containers | Assignee: | Mohammad Saif Shaikh <mohashai> |
| Status: | CLOSED ERRATA | QA Contact: | Ronnie Rasouli <rrasouli> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.10 | CC: | aos-bugs, jvaldes, rrasouli |
| Target Milestone: | --- | ||
| Target Release: | 4.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: An entity other than WMCO modified CertificateSigningRequest resources associated with Windows BYOH nodes
Consequence: WMCO would have a stale reference to the CSR and would be unable to approve it
Fix: Retry CSR approval until a specified timeout if an update conflict is detected
Result: CSR processing goes through cleanly when unrelated changes to BYOH CSRs are made by entities external to WMCO
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-03-28 09:36:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2057161 | ||
|
Description
Mohammad Saif Shaikh
2021-12-13 22:41:43 UTC
Target Release adjusted. Something is restarting the machine approver while machines are being provisioned. This is the cause of the CSR approval conflicts, where the cluster-machine-approver does not stay scaled down so it approves Node CSRs for even our BYOH nodes in CI (which are Machine-backed w/o os=Windows label via machine-api for convenience). We first add a "unmanaged":true override so CVO doesn't stomp back the replica count, then scale down the openshift-cluster-machine-approver/machine-approver deployment but this does not seem like it is enough. Within a few seconds, the machine approver pods are back up. The following log added to the e2e tests shows this in both failures and successful CI runs (hinting at a race between WMCO CSR approver and the machine approver controller): 2021/12/22 18:58:47 scaling deployment openshift-cluster-machine-approver/machine-approver... currently at 1 replicas, scaling to 0 replicas 2021/12/22 18:59:02 done scaling deployment openshift-cluster-machine-approver/machine-approver... currently at 0 replicas, expected 0 replicas 2021/12/22 18:59:03 at start of waitForWindowsMachines()... deployment openshift-cluster-machine-approver/machine-approver currently at 0 replicas, expected 0 replicas 2021/12/22 18:59:08 at end of waitForWindowsMachines()... deployment openshift-cluster-machine-approver/machine-approver currently at 1 replicas, expected 0 replicas This issue is not present in older nightly builds, e.g. 4.10.0-0.nightly-2021-12-01-164437, but is reproducible in newer ones like 4.10.0-0.nightly-2021-12-20-231053. There was some logic added to CVO [0] that relies on the override's group field. But there was an error in the CVO docs [1] around this group field, which we likely followed when creating our override patch data. So when that change went in to CVO it broke how CVO interprets the override (basically ignores it despite the override existing). The fix [2] is in using the proper group value. [0] https://github.com/openshift/cluster-version-operator/pull/689/files#diff-2010b5bb18e3579c7c8a1c79ab439955a723894f85549689a4401790b0315f00R1063-R1064 [1] https://github.com/openshift/enhancements/commit/6491ef6057d8d5167742e4d5390c21c7c324b2b3 [2] https://github.com/openshift/windows-machine-config-operator/pull/844/commits/78978857eff4fd029fbe34a604c18505800eca6a Verified on 660f5a3
No such error occur while bootstrapping BYOH node
controllers.certificatesigningrequests CSR approved {"CSR": "csr-xblsq"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Windows Container Support for Red Hat OpenShift 5.0.0 [security update]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0577 |