Bug 1923111
Summary: | Install plans permanently fail due to CRD resource modified or similar transient errors | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Kevin Fan <chfan> | |
Component: | OLM | Assignee: | Joe Lanford <jlanford> | |
OLM sub component: | OLM | QA Contact: | Bruno Andrade <bandrade> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | alkazako, anbhatta, astefanu, bluddy, davegord, dsover, krizza, llowinge, mjobanek, tflannag | |
Version: | 4.6.z | Keywords: | Triaged | |
Target Milestone: | --- | |||
Target Release: | 4.9.0 | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: On occasion, a transient error occurs when OLM attempts to update a CRD object in the cluster.
Consequence: OLM permanently fails the install plan containing the CRD.
Fix: Update OLM to retry CRD updates on resource modified (conflict) errors
Result: OLM is now more resilient to this class of transient errors. Install plans no longer permanently fail on conflict errors that OLM is able to retry and resolve.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1989779 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:29:03 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1989779 |
Description
Kevin Fan
2021-02-01 12:39:24 UTC
*** Bug 1919454 has been marked as a duplicate of this bug. *** This seems like it could be potentially a race condition or some error handling that should be improved in OLM. For now, I'm marking this with the UpcomingSprint label and this will be investigated in a future sprint. *** Bug 1925113 has been marked as a duplicate of this bug. *** Given that this has been triaged, we are just waiting on this to be prioritized. Marking as reviewed in sprint. *** Bug 1975353 has been marked as a duplicate of this bug. *** What the status of this issue? This looks like a pretty bad bug and it cause serious issues in Dev Sandbox for OpenShift clusters :( Our operators often fail to update because of this bug. Is there anything we can do to help to prioritize this issue? OLM version: 0.18.3 git commit: cf7140bf3c404454892c9c972b0d9e839a46f619 OCP: 4.9.0-0.nightly-2021-08-02-044755 1. Install operator into a namespace (e.g 3scale from OperatorHub. This will install CRDs onto the cluster oc get csv -n test-1 NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4 etcd 0.9.4 Succeeded 2. Run some script that constantly updates the CRDs of the installed operator to mimic some event updating the CRD ``` while true; do oc patch crd etcdclusters.etcd.database.coreos.com --type=json -p='[{"op" : "add", "path" : "/metadata/labels/test", "value": "test"}]' oc patch crd etcdclusters.etcd.database.coreos.com --type=json -p='[{"op" : "remove", "path" : "/metadata/labels/test"}]' done customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched The request is invalid customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched The requests fail when an Operator is being installed ``` 3. Install the same operator into a different namespace oc get csv --all-namespaces NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-operator-lifecycle-manager packageserver Package Server 0.18.3 Succeeded test-1 etcdoperator.v0.9.4 etcd 0.9.4 Succeeded test-2 etcdoperator.v0.9.4 etcd 0.9.4 Succeeded test-3 etcdoperator.v0.9.4 etcd 0.9.4 Succeeded 4. Inspect for failed install plan oc get ip --all-namespaces NAMESPACE NAME CSV APPROVAL APPROVED test-1 install-bxcm2 etcdoperator.v0.9.4 Automatic true test-2 install-p6qjp etcdoperator.v0.9.4 Automatic true test-3 install-qmv2h etcdoperator.v0.9.4 Automatic true test-4 install-4z7qj etcdoperator.v0.9.4 Automatic true LGTM, marking as VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |