Bug 1916021 - OLM enters infinite loop if Pending CSV replaces itself
Summary: OLM enters infinite loop if Pending CSV replaces itself
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Ben Luddy
QA Contact: Salvatore Colangelo
URL:
Whiteboard:
Depends On:
Blocks: 1918525
TreeView+ depends on / blocked
 
Reported: 2021-01-13 23:53 UTC by Ben Luddy
Modified: 2022-10-11 09:28 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The operator specifies skipRange that replaces itself. Consequence: The operator is stuck in an infinite loop as OLM attempts to update the operator with itself. The CPU usage is spiking heavily due to this infinite loop. Fix: Break this infinite loop if the scenario does happen to prevent CPU hogging. Result: The operator is no longer stuck in a loop due to bad skipRange.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:53:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cpu flame graph from olm process (271.49 KB, image/png)
2021-01-13 23:53 UTC, Ben Luddy
no flags Details
sample bad CSV manifest (4.05 KB, text/plain)
2021-01-13 23:54 UTC, Ben Luddy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1966 0 None closed Bug 1916021: Fix infinite loop when a CSV replacement chain contains a cycle. 2021-02-16 12:07:28 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:53:39 UTC

Description Ben Luddy 2021-01-13 23:53:51 UTC
Created attachment 1747236 [details]
cpu flame graph from olm process

Description of problem:

The following test in ./pkg/controller/operators/olm never terminates:

func TestGetReplacementChain(t *testing.T) {
	csv := &v1alpha1.ClusterServiceVersion{
		ObjectMeta: metav1.ObjectMeta{
			Name: "foo",
		},
		Spec: v1alpha1.ClusterServiceVersionSpec{
			Replaces: "foo",
		},
	}
	(&Operator{}).getReplacementChain(csv, map[string]*v1alpha1.ClusterServiceVersion{csv.GetName(): csv})
}


Version-Release number of selected component (if applicable): 4.6.1


How reproducible: Always?


Steps to Reproduce:
1. Create a CSV that replaces itself (sample attached).

Actual results:

The olm-operator pod jumps to 100% CPU utilization and doesn't make progress reconciling the CSV. Even after deleting the CSV, the olm-operator pod has to be deleted in order to recover.

Expected results:

CSV reconciled as normal.

Comment 1 Ben Luddy 2021-01-13 23:54:52 UTC
Created attachment 1747237 [details]
sample bad CSV manifest

Comment 3 Salvatore Colangelo 2021-01-25 21:23:59 UTC
[scolange@scolange BUG-1732914]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-22-134922   True        False  

[scolange@scolange BUG-1732914]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-6b79d4f799-t7vkc -- olm --version
OLM version: 0.17.0
git commit: b925df373dc9abe823193363a3a25b778114a811


1. Create an operatorGroup

[scolange@scolange .kube]$ cat operatorGroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: default-og
  namespace: olm
spec:
  targetNamespaces:
  - olm 

[scolange@scolange .kube]$ oc create -f operatorGroup.yaml 
operatorgroup.operators.coreos.com/default-og created


2. Create a csv in atteched and verify it

[scolange@scolange .kube]$ oc create -f testing.yaml 
clusterserviceversion.operators.coreos.com/packageserver created


[scolange@scolange .kube]$ oc get csv -n olm
NAME            DISPLAY          VERSION   REPLACES        PHASE
packageserver   Package Server   1.0.0     packageserver   Pending


3. Verify the if CPU going to 100% of olm operator 

[scolange@scolange .kube]$ kubectl -n openshift-operator-lifecycle-manager exec --stdin --tty olm-operator-8459bfb7d4-nbx28 -- /bin/bash
bash-4.4$ top

top - 21:19:04 up  1:30,  0 users,  load average: 1.29, 1.04, 1.02
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.7 us,  5.6 sy,  0.0 ni, 77.0 id,  0.1 wa,  1.8 hi,  1.8 si,  0.0 st
MiB Mem :  15016.3 total,   2062.8 free,   6460.9 used,   6492.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   8338.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                     
      1 1001      20   0 1710284 179336  35288 S   0.3   1.2   0:25.59 olm                                                                                                                                         
     26 1001      20   0   12024   3068   2632 S   0.0   0.0   0:00.00 bash                                                                                                                                        
     34 1001      20   0   49112   3924   3288 R   0.0   0.0   0:00.13 top         


4. Delete the csv 

[scolange@scolange .kube]$ oc delete csv packageserver -n olm
clusterserviceversion.operators.coreos.com "packageserver" deleted

LGMT

Comment 6 errata-xmlrpc 2021-02-24 15:53:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.