Bug 2134768

Summary: olm-operator pod is consuming high CPU after ODF installation
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Manuel Gotin <manuel.gotin>
Component: odf-operatorAssignee: Nitin Goyal <nigoyal>
Status: CLOSED NOTABUG QA Contact: Martin Bukatovic <mbukatov>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.12CC: abusch, axel.busch, ebenahar, epasch, Holger.Wolf, jrivera, madam, manuel.gotin, muagarwa, ocs-bugs, odf-bz-bot, rishika.kedia, sostapov
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-26 07:33:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
olm-operator log none

Description Manuel Gotin 2022-10-14 08:41:16 UTC
Created attachment 1918018 [details]
olm-operator log

Created attachment 1918018 [details]
olm-operator log

> Description of problem: 

After installing ODF 4.12 on a fresh OCP 4.12 cluster we experience a massive CPU consumption by the olm-operator of the openshift-operator-lifecycle-manager namespace. 
This issue is also observed for OCP 4.11 + ODF 4.11 on a s390x environment and on x86. 

After ODF installation the olm-operator reports a CPU usage of 700-1200 CPU(cores) -- an increase of >500:

----------------------------------------------------------------
$ oc adm top pods -n openshift-operator-lifecycle-manager

NAME                                      CPU(cores)   MEMORY(bytes)
catalog-operator-76d8bd4744-q7grv         1m           155Mi
olm-operator-59f6f9d47c-4gnh7             766m         257Mi
package-server-manager-845c445bcc-q9mh4   0m           26Mi
packageserver-7bc9dcb955-vfw54            3m           190Mi
packageserver-7bc9dcb955-wqpzj            6m           182Mi
----------------------------------------------------------------

The logs of the olm-operator reveal a sync issue:

----------------------------------------------------------------
$ oc logs olm-operator-59f6f9d47c-4gnh7 -n openshift-operator-lifecycle-manager

{"level":"error","ts":1665648524.4145186,"logger":"controllers.operator","msg":"Could not update Operator status","request":"/ocs-operator.openshift-storage","error":"Operation cannot be fulfilled on operators.operators.coreos.com \"ocs-operator.openshift-storage\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

E1013 09:04:29.346569       1 queueinformer_operator.go:290] sync {"update" "openshift-operator-lifecycle-manager/packageserver"} failed: could not update operatorgroups olm.providedAPIs annotation: Operation cannot be fulfilled on operatorgroups.operators.coreos.com "olm-operators": the object has been modified; please apply your changes to the latest version and try again
E1013 09:07:53.629534       1 queueinformer_operator.go:290] sync {"update" "openshift-operator-lifecycle-manager/packageserver"} failed: could not update operatorgroups olm.providedAPIs annotation: Operation cannot be fulfilled on operatorgroups.operators.coreos.com "olm-operators": the object has been modified; please apply your changes to the latest version and try again
E1013 09:21:30.863620       1 queueinformer_operator.go:290] sync {"update" "openshift-operator-lifecycle-manager/packageserver"} failed: could not update operatorgroups olm.providedAPIs annotation: Operation cannot be fulfilled on operatorgroups.operators.coreos.com "olm-operators": the object has been modified; please apply your changes to the latest version and try again

----------------------------------------------------------------


> Version of all relevant components (if applicable):

4.12: We have observed this issue with OCP 4.12.0-ec.2 and odf-operator.4.12.0.
4.11: We have observed this issue with OCP 4.11.7/4.11.6 and odf-operator.v4.11.0 


> Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?

The high cpu requests of the olm-operator have a severe impact on customers.
The cpu requests translate into one of the CPUs being fully utilized.
This raises the costs of operating ODF on our platform immensely.
On IBM Z the number of consumed CPUs is the key driver for HW costs and customer are extremely sensitive with that regards.

> Is there any workaround available to the best of your knowledge?

There is no known workaround.


> Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)?

1 


> Can this issue reproducible?

Yes, by installing OCP 4.11.7 (or 4.12.0-ec.2) on s390x and x86 and installing odf 4.11 (or odf 4.12) on top of it.


> Can this issue reproduce from the UI?

-

> If this is a regression, please provide more details to justify this:

-

> Steps to Reproduce:

1. Install OCP 4.12.0-ec.2 on s390x or x86
2. Install ODF 4.12
3. Observe the olm-operator cpu requests


> Actual results:

The olm-operator consumes up to 1000m continuously, even after a seemingly successful installation and letting the cluster idle for hours/days.

> Expected results:

The olm-operator should not use that much cpu requests.

> Additional info:

Please see the logs of the olm-operator for ODF 4.12 (s390x).

Comment 2 Manuel Gotin 2022-10-19 11:41:43 UTC
The high cpu requests of the olm-operator have a severe impact on customers.
The cpu requests translate into one of the CPUs being fully utilized.
This raises the costs of operating ODF on our platform immensely.
On IBM Z the number of consumed CPUs is the key driver for HW costs and customer are extremely sensitive with that regards.

Comment 3 Elad 2022-10-19 11:43:49 UTC
Proposing as a blocker for 4.12.0 based on comment #2

Comment 5 Axel Busch 2022-10-26 07:07:07 UTC
Can be closed -> considered in https://issues.redhat.com/browse/OCPBUGS-2556

Comment 6 Nitin Goyal 2022-10-26 07:33:02 UTC
Thanks for the confirmation, Alex. I really appreciate the quick response.

Closing the bug in the odf as the fix is in the OLM and Alex is working on it.