Bug 1916021

Summary: OLM enters infinite loop if Pending CSV replaces itself
Product: OpenShift Container Platform Reporter: Ben Luddy <bluddy>
Component: OLMAssignee: Ben Luddy <bluddy>
OLM sub component: OLM QA Contact: Salvatore Colangelo <scolange>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bluddy, braander, jiazha, ngirard, vdinh
Version: 4.6Keywords: Triaged
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The operator specifies skipRange that replaces itself. Consequence: The operator is stuck in an infinite loop as OLM attempts to update the operator with itself. The CPU usage is spiking heavily due to this infinite loop. Fix: Break this infinite loop if the scenario does happen to prevent CPU hogging. Result: The operator is no longer stuck in a loop due to bad skipRange.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:53:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1918525    
Attachments:
Description Flags
cpu flame graph from olm process
none
sample bad CSV manifest none

Description Ben Luddy 2021-01-13 23:53:51 UTC
Created attachment 1747236 [details]
cpu flame graph from olm process

Description of problem:

The following test in ./pkg/controller/operators/olm never terminates:

func TestGetReplacementChain(t *testing.T) {
	csv := &v1alpha1.ClusterServiceVersion{
		ObjectMeta: metav1.ObjectMeta{
			Name: "foo",
		},
		Spec: v1alpha1.ClusterServiceVersionSpec{
			Replaces: "foo",
		},
	}
	(&Operator{}).getReplacementChain(csv, map[string]*v1alpha1.ClusterServiceVersion{csv.GetName(): csv})
}


Version-Release number of selected component (if applicable): 4.6.1


How reproducible: Always?


Steps to Reproduce:
1. Create a CSV that replaces itself (sample attached).

Actual results:

The olm-operator pod jumps to 100% CPU utilization and doesn't make progress reconciling the CSV. Even after deleting the CSV, the olm-operator pod has to be deleted in order to recover.

Expected results:

CSV reconciled as normal.

Comment 1 Ben Luddy 2021-01-13 23:54:52 UTC
Created attachment 1747237 [details]
sample bad CSV manifest

Comment 3 Salvatore Colangelo 2021-01-25 21:23:59 UTC
[scolange@scolange BUG-1732914]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-22-134922   True        False  

[scolange@scolange BUG-1732914]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-6b79d4f799-t7vkc -- olm --version
OLM version: 0.17.0
git commit: b925df373dc9abe823193363a3a25b778114a811


1. Create an operatorGroup

[scolange@scolange .kube]$ cat operatorGroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: default-og
  namespace: olm
spec:
  targetNamespaces:
  - olm 

[scolange@scolange .kube]$ oc create -f operatorGroup.yaml 
operatorgroup.operators.coreos.com/default-og created


2. Create a csv in atteched and verify it

[scolange@scolange .kube]$ oc create -f testing.yaml 
clusterserviceversion.operators.coreos.com/packageserver created


[scolange@scolange .kube]$ oc get csv -n olm
NAME            DISPLAY          VERSION   REPLACES        PHASE
packageserver   Package Server   1.0.0     packageserver   Pending


3. Verify the if CPU going to 100% of olm operator 

[scolange@scolange .kube]$ kubectl -n openshift-operator-lifecycle-manager exec --stdin --tty olm-operator-8459bfb7d4-nbx28 -- /bin/bash
bash-4.4$ top

top - 21:19:04 up  1:30,  0 users,  load average: 1.29, 1.04, 1.02
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.7 us,  5.6 sy,  0.0 ni, 77.0 id,  0.1 wa,  1.8 hi,  1.8 si,  0.0 st
MiB Mem :  15016.3 total,   2062.8 free,   6460.9 used,   6492.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   8338.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                     
      1 1001      20   0 1710284 179336  35288 S   0.3   1.2   0:25.59 olm                                                                                                                                         
     26 1001      20   0   12024   3068   2632 S   0.0   0.0   0:00.00 bash                                                                                                                                        
     34 1001      20   0   49112   3924   3288 R   0.0   0.0   0:00.13 top         


4. Delete the csv 

[scolange@scolange .kube]$ oc delete csv packageserver -n olm
clusterserviceversion.operators.coreos.com "packageserver" deleted

LGMT

Comment 6 errata-xmlrpc 2021-02-24 15:53:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633