Bug 1923111

Summary: Install plans permanently fail due to CRD resource modified or similar transient errors
Product: OpenShift Container Platform Reporter: Kevin Fan <chfan>
Component: OLMAssignee: Joe Lanford <jlanford>
OLM sub component: OLM QA Contact: Bruno Andrade <bandrade>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: alkazako, anbhatta, astefanu, bluddy, davegord, dsover, krizza, llowinge, mjobanek, tflannag
Version: 4.6.zKeywords: Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: On occasion, a transient error occurs when OLM attempts to update a CRD object in the cluster. Consequence: OLM permanently fails the install plan containing the CRD. Fix: Update OLM to retry CRD updates on resource modified (conflict) errors Result: OLM is now more resilient to this class of transient errors. Install plans no longer permanently fail on conflict errors that OLM is able to retry and resolve.
Story Points: ---
Clone Of:
: 1989779 (view as bug list) Environment:
Last Closed: 2021-10-18 17:29:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1989779    

Description Kevin Fan 2021-02-01 12:39:24 UTC
User-Agent:       Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36
Build Identifier: 

Initial install of operator in a namespace succeeds, however if the same operator is installed in another namespace, the install plan can fail from CRDs resource modified error during install.

Possibily some event that updates the CRD during the install causing install plan to fail.


Reproducible: Sometimes

Steps to Reproduce:
1. Install operator into namespace (e.g 3scale from OperatorHub. This will install CRDs onto the cluster
2. Run some script that constantly updates the a CRDs of the installed operator to mimic some event updating the CRD
```
 while true; do        
    oc patch crd apimanagers.apps.3scale.net  --type=json -p='[{"op" : "add", "path" : "/metadata/labels/test", "value": "test"}]'
    oc patch crd apimanagers.apps.3scale.net --type=json -p='[{"op" : "remove", "path" : "/metadata/labels/test"}]'
done
```
3. Install the same operator into a different namespace 
4. Inspect for failed install plan 
5. If there is no failed install plan, uninstall and reinstall operator until failed install plan occurs
Actual Results:  
Install plan sometimes fails due to resource modified

Expected Results:  
Install plan should retry to install if resource is stale

This was reproduced on some installs on RHOAM on OSD when installing user sso and observed a similiar error for openshift route monitor operator on the same cluster
https://issues.redhat.com/browse/MGDAPI-1098

Comment 1 Daniel Sover 2021-02-01 15:22:56 UTC
*** Bug 1919454 has been marked as a duplicate of this bug. ***

Comment 2 Kevin Rizza 2021-02-01 16:20:47 UTC
This seems like it could be potentially a race condition or some error handling that should be improved in OLM. For now, I'm marking this with the UpcomingSprint label and this will be investigated in a future sprint.

Comment 4 Ben Luddy 2021-02-25 04:53:53 UTC
*** Bug 1925113 has been marked as a duplicate of this bug. ***

Comment 5 Kevin Rizza 2021-03-25 18:54:46 UTC
Given that this has been triaged, we are just waiting on this to be prioritized. Marking as reviewed in sprint.

Comment 9 Ben Luddy 2021-06-23 16:10:59 UTC
*** Bug 1975353 has been marked as a duplicate of this bug. ***

Comment 10 Alexey Kazakov 2021-06-24 07:00:36 UTC
What the status of this issue?
This looks like a pretty bad bug and it cause serious issues in Dev Sandbox for OpenShift clusters :( Our operators often fail to update because of this bug.
Is there anything we can do to help to prioritize this issue?

Comment 13 Bruno Andrade 2021-08-02 17:34:50 UTC
OLM version: 0.18.3
git commit: cf7140bf3c404454892c9c972b0d9e839a46f619
OCP: 4.9.0-0.nightly-2021-08-02-044755

1. Install operator into a namespace (e.g 3scale from OperatorHub. This will install CRDs onto the cluster

oc get csv -n test-1           
NAME                  DISPLAY   VERSION   REPLACES   PHASE
etcdoperator.v0.9.4   etcd      0.9.4                Succeeded


2. Run some script that constantly updates the CRDs of the installed operator to mimic some event updating the CRD
```
 while true; do        
    oc patch crd etcdclusters.etcd.database.coreos.com  --type=json -p='[{"op" : "add", "path" : "/metadata/labels/test", "value": "test"}]'
    oc patch crd etcdclusters.etcd.database.coreos.com --type=json -p='[{"op" : "remove", "path" : "/metadata/labels/test"}]'
done

customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
The request is invalid
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched

The requests fail when an Operator is being installed

```
3. Install the same operator into a different namespace 

oc get csv --all-namespaces                  
NAMESPACE                              NAME                  DISPLAY          VERSION   REPLACES   PHASE
openshift-operator-lifecycle-manager   packageserver         Package Server   0.18.3               Succeeded
test-1                                 etcdoperator.v0.9.4   etcd             0.9.4                Succeeded
test-2                                 etcdoperator.v0.9.4   etcd             0.9.4                Succeeded
test-3                                 etcdoperator.v0.9.4   etcd             0.9.4                Succeeded

4. Inspect for failed install plan 

oc get ip --all-namespaces  
NAMESPACE   NAME            CSV                   APPROVAL    APPROVED
test-1      install-bxcm2   etcdoperator.v0.9.4   Automatic   true
test-2      install-p6qjp   etcdoperator.v0.9.4   Automatic   true
test-3      install-qmv2h   etcdoperator.v0.9.4   Automatic   true
test-4      install-4z7qj   etcdoperator.v0.9.4   Automatic   true

LGTM, marking as VERIFIED

Comment 16 errata-xmlrpc 2021-10-18 17:29:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759