Bug 1916021
| Summary: | OLM enters infinite loop if Pending CSV replaces itself | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Luddy <bluddy> | ||||||
| Component: | OLM | Assignee: | Ben Luddy <bluddy> | ||||||
| OLM sub component: | OLM | QA Contact: | Salvatore Colangelo <scolange> | ||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||
| Severity: | medium | ||||||||
| Priority: | medium | CC: | bluddy, braander, jiazha, ngirard, vdinh | ||||||
| Version: | 4.6 | Keywords: | Triaged | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 4.7.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: |
Cause: The operator specifies skipRange that replaces itself.
Consequence: The operator is stuck in an infinite loop as OLM attempts to update the operator with itself. The CPU usage is spiking heavily due to this infinite loop.
Fix: Break this infinite loop if the scenario does happen to prevent CPU hogging.
Result: The operator is no longer stuck in a loop due to bad skipRange.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-02-24 15:53:14 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1918525 | ||||||||
| Attachments: |
|
||||||||
Created attachment 1747237 [details]
sample bad CSV manifest
[scolange@scolange BUG-1732914]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.nightly-2021-01-22-134922 True False
[scolange@scolange BUG-1732914]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-6b79d4f799-t7vkc -- olm --version
OLM version: 0.17.0
git commit: b925df373dc9abe823193363a3a25b778114a811
1. Create an operatorGroup
[scolange@scolange .kube]$ cat operatorGroup.yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: default-og
namespace: olm
spec:
targetNamespaces:
- olm
[scolange@scolange .kube]$ oc create -f operatorGroup.yaml
operatorgroup.operators.coreos.com/default-og created
2. Create a csv in atteched and verify it
[scolange@scolange .kube]$ oc create -f testing.yaml
clusterserviceversion.operators.coreos.com/packageserver created
[scolange@scolange .kube]$ oc get csv -n olm
NAME DISPLAY VERSION REPLACES PHASE
packageserver Package Server 1.0.0 packageserver Pending
3. Verify the if CPU going to 100% of olm operator
[scolange@scolange .kube]$ kubectl -n openshift-operator-lifecycle-manager exec --stdin --tty olm-operator-8459bfb7d4-nbx28 -- /bin/bash
bash-4.4$ top
top - 21:19:04 up 1:30, 0 users, load average: 1.29, 1.04, 1.02
Tasks: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 13.7 us, 5.6 sy, 0.0 ni, 77.0 id, 0.1 wa, 1.8 hi, 1.8 si, 0.0 st
MiB Mem : 15016.3 total, 2062.8 free, 6460.9 used, 6492.7 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 8338.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 1001 20 0 1710284 179336 35288 S 0.3 1.2 0:25.59 olm
26 1001 20 0 12024 3068 2632 S 0.0 0.0 0:00.00 bash
34 1001 20 0 49112 3924 3288 R 0.0 0.0 0:00.13 top
4. Delete the csv
[scolange@scolange .kube]$ oc delete csv packageserver -n olm
clusterserviceversion.operators.coreos.com "packageserver" deleted
LGMT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Created attachment 1747236 [details] cpu flame graph from olm process Description of problem: The following test in ./pkg/controller/operators/olm never terminates: func TestGetReplacementChain(t *testing.T) { csv := &v1alpha1.ClusterServiceVersion{ ObjectMeta: metav1.ObjectMeta{ Name: "foo", }, Spec: v1alpha1.ClusterServiceVersionSpec{ Replaces: "foo", }, } (&Operator{}).getReplacementChain(csv, map[string]*v1alpha1.ClusterServiceVersion{csv.GetName(): csv}) } Version-Release number of selected component (if applicable): 4.6.1 How reproducible: Always? Steps to Reproduce: 1. Create a CSV that replaces itself (sample attached). Actual results: The olm-operator pod jumps to 100% CPU utilization and doesn't make progress reconciling the CSV. Even after deleting the CSV, the olm-operator pod has to be deleted in order to recover. Expected results: CSV reconciled as normal.