Bug 1923111
| Summary: | Install plans permanently fail due to CRD resource modified or similar transient errors | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Kevin Fan <chfan> | |
| Component: | OLM | Assignee: | Joe Lanford <jlanford> | |
| OLM sub component: | OLM | QA Contact: | Bruno Andrade <bandrade> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | alkazako, anbhatta, astefanu, bluddy, davegord, dsover, krizza, llowinge, mjobanek, tflannag | |
| Version: | 4.6.z | Keywords: | Triaged | |
| Target Milestone: | --- | |||
| Target Release: | 4.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: On occasion, a transient error occurs when OLM attempts to update a CRD object in the cluster.
Consequence: OLM permanently fails the install plan containing the CRD.
Fix: Update OLM to retry CRD updates on resource modified (conflict) errors
Result: OLM is now more resilient to this class of transient errors. Install plans no longer permanently fail on conflict errors that OLM is able to retry and resolve.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1989779 (view as bug list) | Environment: | ||
| Last Closed: | 2021-10-18 17:29:03 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1989779 | |||
*** Bug 1919454 has been marked as a duplicate of this bug. *** This seems like it could be potentially a race condition or some error handling that should be improved in OLM. For now, I'm marking this with the UpcomingSprint label and this will be investigated in a future sprint. *** Bug 1925113 has been marked as a duplicate of this bug. *** Given that this has been triaged, we are just waiting on this to be prioritized. Marking as reviewed in sprint. *** Bug 1975353 has been marked as a duplicate of this bug. *** What the status of this issue? This looks like a pretty bad bug and it cause serious issues in Dev Sandbox for OpenShift clusters :( Our operators often fail to update because of this bug. Is there anything we can do to help to prioritize this issue? OLM version: 0.18.3
git commit: cf7140bf3c404454892c9c972b0d9e839a46f619
OCP: 4.9.0-0.nightly-2021-08-02-044755
1. Install operator into a namespace (e.g 3scale from OperatorHub. This will install CRDs onto the cluster
oc get csv -n test-1
NAME DISPLAY VERSION REPLACES PHASE
etcdoperator.v0.9.4 etcd 0.9.4 Succeeded
2. Run some script that constantly updates the CRDs of the installed operator to mimic some event updating the CRD
```
while true; do
oc patch crd etcdclusters.etcd.database.coreos.com --type=json -p='[{"op" : "add", "path" : "/metadata/labels/test", "value": "test"}]'
oc patch crd etcdclusters.etcd.database.coreos.com --type=json -p='[{"op" : "remove", "path" : "/metadata/labels/test"}]'
done
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
The request is invalid
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
The requests fail when an Operator is being installed
```
3. Install the same operator into a different namespace
oc get csv --all-namespaces
NAMESPACE NAME DISPLAY VERSION REPLACES PHASE
openshift-operator-lifecycle-manager packageserver Package Server 0.18.3 Succeeded
test-1 etcdoperator.v0.9.4 etcd 0.9.4 Succeeded
test-2 etcdoperator.v0.9.4 etcd 0.9.4 Succeeded
test-3 etcdoperator.v0.9.4 etcd 0.9.4 Succeeded
4. Inspect for failed install plan
oc get ip --all-namespaces
NAMESPACE NAME CSV APPROVAL APPROVED
test-1 install-bxcm2 etcdoperator.v0.9.4 Automatic true
test-2 install-p6qjp etcdoperator.v0.9.4 Automatic true
test-3 install-qmv2h etcdoperator.v0.9.4 Automatic true
test-4 install-4z7qj etcdoperator.v0.9.4 Automatic true
LGTM, marking as VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |
User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 Build Identifier: Initial install of operator in a namespace succeeds, however if the same operator is installed in another namespace, the install plan can fail from CRDs resource modified error during install. Possibily some event that updates the CRD during the install causing install plan to fail. Reproducible: Sometimes Steps to Reproduce: 1. Install operator into namespace (e.g 3scale from OperatorHub. This will install CRDs onto the cluster 2. Run some script that constantly updates the a CRDs of the installed operator to mimic some event updating the CRD ``` while true; do oc patch crd apimanagers.apps.3scale.net --type=json -p='[{"op" : "add", "path" : "/metadata/labels/test", "value": "test"}]' oc patch crd apimanagers.apps.3scale.net --type=json -p='[{"op" : "remove", "path" : "/metadata/labels/test"}]' done ``` 3. Install the same operator into a different namespace 4. Inspect for failed install plan 5. If there is no failed install plan, uninstall and reinstall operator until failed install plan occurs Actual Results: Install plan sometimes fails due to resource modified Expected Results: Install plan should retry to install if resource is stale This was reproduced on some installs on RHOAM on OSD when installing user sso and observed a similiar error for openshift route monitor operator on the same cluster https://issues.redhat.com/browse/MGDAPI-1098