Bug 1916021
Summary: | OLM enters infinite loop if Pending CSV replaces itself | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Luddy <bluddy> | ||||||
Component: | OLM | Assignee: | Ben Luddy <bluddy> | ||||||
OLM sub component: | OLM | QA Contact: | Salvatore Colangelo <scolange> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | medium | ||||||||
Priority: | medium | CC: | bluddy, braander, jiazha, ngirard, vdinh | ||||||
Version: | 4.6 | Keywords: | Triaged | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.7.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause: The operator specifies skipRange that replaces itself.
Consequence: The operator is stuck in an infinite loop as OLM attempts to update the operator with itself. The CPU usage is spiking heavily due to this infinite loop.
Fix: Break this infinite loop if the scenario does happen to prevent CPU hogging.
Result: The operator is no longer stuck in a loop due to bad skipRange.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-02-24 15:53:14 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1918525 | ||||||||
Attachments: |
|
Created attachment 1747237 [details]
sample bad CSV manifest
[scolange@scolange BUG-1732914]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-22-134922 True False [scolange@scolange BUG-1732914]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-6b79d4f799-t7vkc -- olm --version OLM version: 0.17.0 git commit: b925df373dc9abe823193363a3a25b778114a811 1. Create an operatorGroup [scolange@scolange .kube]$ cat operatorGroup.yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: default-og namespace: olm spec: targetNamespaces: - olm [scolange@scolange .kube]$ oc create -f operatorGroup.yaml operatorgroup.operators.coreos.com/default-og created 2. Create a csv in atteched and verify it [scolange@scolange .kube]$ oc create -f testing.yaml clusterserviceversion.operators.coreos.com/packageserver created [scolange@scolange .kube]$ oc get csv -n olm NAME DISPLAY VERSION REPLACES PHASE packageserver Package Server 1.0.0 packageserver Pending 3. Verify the if CPU going to 100% of olm operator [scolange@scolange .kube]$ kubectl -n openshift-operator-lifecycle-manager exec --stdin --tty olm-operator-8459bfb7d4-nbx28 -- /bin/bash bash-4.4$ top top - 21:19:04 up 1:30, 0 users, load average: 1.29, 1.04, 1.02 Tasks: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie %Cpu(s): 13.7 us, 5.6 sy, 0.0 ni, 77.0 id, 0.1 wa, 1.8 hi, 1.8 si, 0.0 st MiB Mem : 15016.3 total, 2062.8 free, 6460.9 used, 6492.7 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 8338.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 1001 20 0 1710284 179336 35288 S 0.3 1.2 0:25.59 olm 26 1001 20 0 12024 3068 2632 S 0.0 0.0 0:00.00 bash 34 1001 20 0 49112 3924 3288 R 0.0 0.0 0:00.13 top 4. Delete the csv [scolange@scolange .kube]$ oc delete csv packageserver -n olm clusterserviceversion.operators.coreos.com "packageserver" deleted LGMT Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Created attachment 1747236 [details] cpu flame graph from olm process Description of problem: The following test in ./pkg/controller/operators/olm never terminates: func TestGetReplacementChain(t *testing.T) { csv := &v1alpha1.ClusterServiceVersion{ ObjectMeta: metav1.ObjectMeta{ Name: "foo", }, Spec: v1alpha1.ClusterServiceVersionSpec{ Replaces: "foo", }, } (&Operator{}).getReplacementChain(csv, map[string]*v1alpha1.ClusterServiceVersion{csv.GetName(): csv}) } Version-Release number of selected component (if applicable): 4.6.1 How reproducible: Always? Steps to Reproduce: 1. Create a CSV that replaces itself (sample attached). Actual results: The olm-operator pod jumps to 100% CPU utilization and doesn't make progress reconciling the CSV. Even after deleting the CSV, the olm-operator pod has to be deleted in order to recover. Expected results: CSV reconciled as normal.