Created attachment 1747236 [details] cpu flame graph from olm process Description of problem: The following test in ./pkg/controller/operators/olm never terminates: func TestGetReplacementChain(t *testing.T) { csv := &v1alpha1.ClusterServiceVersion{ ObjectMeta: metav1.ObjectMeta{ Name: "foo", }, Spec: v1alpha1.ClusterServiceVersionSpec{ Replaces: "foo", }, } (&Operator{}).getReplacementChain(csv, map[string]*v1alpha1.ClusterServiceVersion{csv.GetName(): csv}) } Version-Release number of selected component (if applicable): 4.6.1 How reproducible: Always? Steps to Reproduce: 1. Create a CSV that replaces itself (sample attached). Actual results: The olm-operator pod jumps to 100% CPU utilization and doesn't make progress reconciling the CSV. Even after deleting the CSV, the olm-operator pod has to be deleted in order to recover. Expected results: CSV reconciled as normal.
Created attachment 1747237 [details] sample bad CSV manifest
[scolange@scolange BUG-1732914]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-22-134922 True False [scolange@scolange BUG-1732914]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-6b79d4f799-t7vkc -- olm --version OLM version: 0.17.0 git commit: b925df373dc9abe823193363a3a25b778114a811 1. Create an operatorGroup [scolange@scolange .kube]$ cat operatorGroup.yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: default-og namespace: olm spec: targetNamespaces: - olm [scolange@scolange .kube]$ oc create -f operatorGroup.yaml operatorgroup.operators.coreos.com/default-og created 2. Create a csv in atteched and verify it [scolange@scolange .kube]$ oc create -f testing.yaml clusterserviceversion.operators.coreos.com/packageserver created [scolange@scolange .kube]$ oc get csv -n olm NAME DISPLAY VERSION REPLACES PHASE packageserver Package Server 1.0.0 packageserver Pending 3. Verify the if CPU going to 100% of olm operator [scolange@scolange .kube]$ kubectl -n openshift-operator-lifecycle-manager exec --stdin --tty olm-operator-8459bfb7d4-nbx28 -- /bin/bash bash-4.4$ top top - 21:19:04 up 1:30, 0 users, load average: 1.29, 1.04, 1.02 Tasks: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie %Cpu(s): 13.7 us, 5.6 sy, 0.0 ni, 77.0 id, 0.1 wa, 1.8 hi, 1.8 si, 0.0 st MiB Mem : 15016.3 total, 2062.8 free, 6460.9 used, 6492.7 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 8338.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 1001 20 0 1710284 179336 35288 S 0.3 1.2 0:25.59 olm 26 1001 20 0 12024 3068 2632 S 0.0 0.0 0:00.00 bash 34 1001 20 0 49112 3924 3288 R 0.0 0.0 0:00.13 top 4. Delete the csv [scolange@scolange .kube]$ oc delete csv packageserver -n olm clusterserviceversion.operators.coreos.com "packageserver" deleted LGMT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633