1923111 – Install plans permanently fail due to CRD resource modified or similar transient errors

Bug 1923111 - Install plans permanently fail due to CRD resource modified or similar transient errors

Summary: Install plans permanently fail due to CRD resource modified or similar transi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Joe Lanford
QA Contact:	Bruno Andrade
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1919454 1925113 1975353 (view as bug list)
Depends On:
Blocks:	1989779
TreeView+	depends on / blocked

Reported:	2021-02-01 12:39 UTC by Kevin Fan
Modified:	2021-10-18 17:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: On occasion, a transient error occurs when OLM attempts to update a CRD object in the cluster. Consequence: OLM permanently fails the install plan containing the CRD. Fix: Update OLM to retry CRD updates on resource modified (conflict) errors Result: OLM is now more resilient to this class of transient errors. Install plans no longer permanently fail on conflict errors that OLM is able to retry and resolve.
Clone Of:
Clones:	1989779 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:29:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift operator-framework-olm pull 143	0	None	None	None	2021-07-29 20:55:56 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:29:49 UTC

Description Kevin Fan 2021-02-01 12:39:24 UTC

User-Agent:       Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36
Build Identifier: 

Initial install of operator in a namespace succeeds, however if the same operator is installed in another namespace, the install plan can fail from CRDs resource modified error during install.

Possibily some event that updates the CRD during the install causing install plan to fail.


Reproducible: Sometimes

Steps to Reproduce:
1. Install operator into namespace (e.g 3scale from OperatorHub. This will install CRDs onto the cluster
2. Run some script that constantly updates the a CRDs of the installed operator to mimic some event updating the CRD
```
 while true; do        
    oc patch crd apimanagers.apps.3scale.net  --type=json -p='[{"op" : "add", "path" : "/metadata/labels/test", "value": "test"}]'
    oc patch crd apimanagers.apps.3scale.net --type=json -p='[{"op" : "remove", "path" : "/metadata/labels/test"}]'
done
```
3. Install the same operator into a different namespace 
4. Inspect for failed install plan 
5. If there is no failed install plan, uninstall and reinstall operator until failed install plan occurs
Actual Results:  
Install plan sometimes fails due to resource modified

Expected Results:  
Install plan should retry to install if resource is stale

This was reproduced on some installs on RHOAM on OSD when installing user sso and observed a similiar error for openshift route monitor operator on the same cluster
https://issues.redhat.com/browse/MGDAPI-1098

Comment 1 Daniel Sover 2021-02-01 15:22:56 UTC

*** Bug 1919454 has been marked as a duplicate of this bug. ***

Comment 2 Kevin Rizza 2021-02-01 16:20:47 UTC

This seems like it could be potentially a race condition or some error handling that should be improved in OLM. For now, I'm marking this with the UpcomingSprint label and this will be investigated in a future sprint.

Comment 4 Ben Luddy 2021-02-25 04:53:53 UTC

*** Bug 1925113 has been marked as a duplicate of this bug. ***

Comment 5 Kevin Rizza 2021-03-25 18:54:46 UTC

Given that this has been triaged, we are just waiting on this to be prioritized. Marking as reviewed in sprint.

Comment 9 Ben Luddy 2021-06-23 16:10:59 UTC

*** Bug 1975353 has been marked as a duplicate of this bug. ***

Comment 10 Alexey Kazakov 2021-06-24 07:00:36 UTC

What the status of this issue?
This looks like a pretty bad bug and it cause serious issues in Dev Sandbox for OpenShift clusters :( Our operators often fail to update because of this bug.
Is there anything we can do to help to prioritize this issue?

Comment 13 Bruno Andrade 2021-08-02 17:34:50 UTC

OLM version: 0.18.3
git commit: cf7140bf3c404454892c9c972b0d9e839a46f619
OCP: 4.9.0-0.nightly-2021-08-02-044755

1. Install operator into a namespace (e.g 3scale from OperatorHub. This will install CRDs onto the cluster

oc get csv -n test-1           
NAME                  DISPLAY   VERSION   REPLACES   PHASE
etcdoperator.v0.9.4   etcd      0.9.4                Succeeded


2. Run some script that constantly updates the CRDs of the installed operator to mimic some event updating the CRD
```
 while true; do        
    oc patch crd etcdclusters.etcd.database.coreos.com  --type=json -p='[{"op" : "add", "path" : "/metadata/labels/test", "value": "test"}]'
    oc patch crd etcdclusters.etcd.database.coreos.com --type=json -p='[{"op" : "remove", "path" : "/metadata/labels/test"}]'
done

customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
The request is invalid
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched
customresourcedefinition.apiextensions.k8s.io/etcdclusters.etcd.database.coreos.com patched

The requests fail when an Operator is being installed

```
3. Install the same operator into a different namespace 

oc get csv --all-namespaces                  
NAMESPACE                              NAME                  DISPLAY          VERSION   REPLACES   PHASE
openshift-operator-lifecycle-manager   packageserver         Package Server   0.18.3               Succeeded
test-1                                 etcdoperator.v0.9.4   etcd             0.9.4                Succeeded
test-2                                 etcdoperator.v0.9.4   etcd             0.9.4                Succeeded
test-3                                 etcdoperator.v0.9.4   etcd             0.9.4                Succeeded

4. Inspect for failed install plan 

oc get ip --all-namespaces  
NAMESPACE   NAME            CSV                   APPROVAL    APPROVED
test-1      install-bxcm2   etcdoperator.v0.9.4   Automatic   true
test-2      install-p6qjp   etcdoperator.v0.9.4   Automatic   true
test-3      install-qmv2h   etcdoperator.v0.9.4   Automatic   true
test-4      install-4z7qj   etcdoperator.v0.9.4   Automatic   true

LGTM, marking as VERIFIED

Comment 16 errata-xmlrpc 2021-10-18 17:29:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.