Bug 1918525

Summary: OLM enters infinite loop if Pending CSV replaces itself
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: OLMAssignee: Ben Luddy <bluddy>
OLM sub component: OLM QA Contact: Salvatore Colangelo <scolange>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bluddy, vdinh
Version: 4.6Keywords: Triaged
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-08 13:51:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1916021    
Bug Blocks:    

Description OpenShift BugZilla Robot 2021-01-20 23:17:04 UTC
+++ This bug was initially created as a clone of Bug #1916021 +++

Created attachment 1747236 [details]
cpu flame graph from olm process

Description of problem:

The following test in ./pkg/controller/operators/olm never terminates:

func TestGetReplacementChain(t *testing.T) {
	csv := &v1alpha1.ClusterServiceVersion{
		ObjectMeta: metav1.ObjectMeta{
			Name: "foo",
		},
		Spec: v1alpha1.ClusterServiceVersionSpec{
			Replaces: "foo",
		},
	}
	(&Operator{}).getReplacementChain(csv, map[string]*v1alpha1.ClusterServiceVersion{csv.GetName(): csv})
}


Version-Release number of selected component (if applicable): 4.6.1


How reproducible: Always?


Steps to Reproduce:
1. Create a CSV that replaces itself (sample attached).

Actual results:

The olm-operator pod jumps to 100% CPU utilization and doesn't make progress reconciling the CSV. Even after deleting the CSV, the olm-operator pod has to be deleted in order to recover.

Expected results:

CSV reconciled as normal.

--- Additional comment from bluddy on 2021-01-13 23:54:52 UTC ---

Created attachment 1747237 [details]
sample bad CSV manifest

Comment 4 Salvatore Colangelo 2021-02-01 12:03:07 UTC
[scolange@scolange ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-01-30-211400   True        False         23m     Cluster version is 4.6.0-0.nightly-2021-01-30-211

[scolange@scolange ~]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-5f9bfcf948-dm25n -- olm --version
OLM version: 0.16.1
git commit: 4268b669a6f90423a4eea3d5bdcf6bf00af48a6d

[scolange@scolange .kube]$ oc create ns olm
namespace/olm created

1. Create an operatorGroup

[scolange@scolange .kube]$ oc create -f operatorGroup.yaml 
operatorgroup.operators.coreos.com/default-og created

2. Create a csv in atteched and verify it

[scolange@scolange .kube]$ oc create -f testing.yaml
clusterserviceversion.operators.coreos.com/packageserver created

3. Verify the if CPU going to 100% of olm operator 

[scolange@scolange .kube]$ oc get csv -n olm
NAME            DISPLAY          VERSION   REPLACES        PHASE
packageserver   Package Server   1.0.0     packageserver   Pending


kubectl -n openshift-operator-lifecycle-manager exec --stdin --tty olm-operator-5d865c694d-fjwjj -- /bin/bash

top - 12:00:09 up 23 min,  0 users,  load average: 1.76, 1.45, 0.94
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 37.6 us, 10.2 sy,  0.0 ni, 46.0 id,  0.3 wa,  2.1 hi,  3.9 si,  0.0 st
MiB Mem :  15025.6 total,   6455.2 free,   5547.2 used,   3023.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  10277.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                   
      1 1001      20   0 1709652 252460  32804 S   0.0   1.6   0:07.16 olm                                                       
     18 1001      20   0   12020   3160   2744 S   0.0   0.0   0:00.00 bash                                                      
     25 1001      20   0   49040   3828   3228 R   0.0   0.0   0:00.02 top 

4. Delete the csv 

[scolange@scolange .kube]$ oc delete csv packageserver -n olm
clusterserviceversion.operators.coreos.com "packageserver" deleted


LGMT

Comment 6 errata-xmlrpc 2021-02-08 13:51:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.6.16 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0308