Bug 2134768 - olm-operator pod is consuming high CPU after ODF installation
Summary: olm-operator pod is consuming high CPU after ODF installation
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-operator
Version: 4.12
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Nitin Goyal
QA Contact: Martin Bukatovic
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-14 08:41 UTC by Manuel Gotin
Modified: 2023-08-09 17:00 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-26 07:33:02 UTC
Embargoed:


Attachments (Terms of Use)
olm-operator log (770.39 KB, text/plain)
2022-10-14 08:41 UTC, Manuel Gotin
no flags Details

Description Manuel Gotin 2022-10-14 08:41:16 UTC
Created attachment 1918018 [details]
olm-operator log

Created attachment 1918018 [details]
olm-operator log

> Description of problem: 

After installing ODF 4.12 on a fresh OCP 4.12 cluster we experience a massive CPU consumption by the olm-operator of the openshift-operator-lifecycle-manager namespace. 
This issue is also observed for OCP 4.11 + ODF 4.11 on a s390x environment and on x86. 

After ODF installation the olm-operator reports a CPU usage of 700-1200 CPU(cores) -- an increase of >500:

----------------------------------------------------------------
$ oc adm top pods -n openshift-operator-lifecycle-manager

NAME                                      CPU(cores)   MEMORY(bytes)
catalog-operator-76d8bd4744-q7grv         1m           155Mi
olm-operator-59f6f9d47c-4gnh7             766m         257Mi
package-server-manager-845c445bcc-q9mh4   0m           26Mi
packageserver-7bc9dcb955-vfw54            3m           190Mi
packageserver-7bc9dcb955-wqpzj            6m           182Mi
----------------------------------------------------------------

The logs of the olm-operator reveal a sync issue:

----------------------------------------------------------------
$ oc logs olm-operator-59f6f9d47c-4gnh7 -n openshift-operator-lifecycle-manager

{"level":"error","ts":1665648524.4145186,"logger":"controllers.operator","msg":"Could not update Operator status","request":"/ocs-operator.openshift-storage","error":"Operation cannot be fulfilled on operators.operators.coreos.com \"ocs-operator.openshift-storage\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

E1013 09:04:29.346569       1 queueinformer_operator.go:290] sync {"update" "openshift-operator-lifecycle-manager/packageserver"} failed: could not update operatorgroups olm.providedAPIs annotation: Operation cannot be fulfilled on operatorgroups.operators.coreos.com "olm-operators": the object has been modified; please apply your changes to the latest version and try again
E1013 09:07:53.629534       1 queueinformer_operator.go:290] sync {"update" "openshift-operator-lifecycle-manager/packageserver"} failed: could not update operatorgroups olm.providedAPIs annotation: Operation cannot be fulfilled on operatorgroups.operators.coreos.com "olm-operators": the object has been modified; please apply your changes to the latest version and try again
E1013 09:21:30.863620       1 queueinformer_operator.go:290] sync {"update" "openshift-operator-lifecycle-manager/packageserver"} failed: could not update operatorgroups olm.providedAPIs annotation: Operation cannot be fulfilled on operatorgroups.operators.coreos.com "olm-operators": the object has been modified; please apply your changes to the latest version and try again

----------------------------------------------------------------


> Version of all relevant components (if applicable):

4.12: We have observed this issue with OCP 4.12.0-ec.2 and odf-operator.4.12.0.
4.11: We have observed this issue with OCP 4.11.7/4.11.6 and odf-operator.v4.11.0 


> Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)?

The high cpu requests of the olm-operator have a severe impact on customers.
The cpu requests translate into one of the CPUs being fully utilized.
This raises the costs of operating ODF on our platform immensely.
On IBM Z the number of consumed CPUs is the key driver for HW costs and customer are extremely sensitive with that regards.

> Is there any workaround available to the best of your knowledge?

There is no known workaround.


> Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)?

1 


> Can this issue reproducible?

Yes, by installing OCP 4.11.7 (or 4.12.0-ec.2) on s390x and x86 and installing odf 4.11 (or odf 4.12) on top of it.


> Can this issue reproduce from the UI?

-

> If this is a regression, please provide more details to justify this:

-

> Steps to Reproduce:

1. Install OCP 4.12.0-ec.2 on s390x or x86
2. Install ODF 4.12
3. Observe the olm-operator cpu requests


> Actual results:

The olm-operator consumes up to 1000m continuously, even after a seemingly successful installation and letting the cluster idle for hours/days.

> Expected results:

The olm-operator should not use that much cpu requests.

> Additional info:

Please see the logs of the olm-operator for ODF 4.12 (s390x).

Comment 2 Manuel Gotin 2022-10-19 11:41:43 UTC
The high cpu requests of the olm-operator have a severe impact on customers.
The cpu requests translate into one of the CPUs being fully utilized.
This raises the costs of operating ODF on our platform immensely.
On IBM Z the number of consumed CPUs is the key driver for HW costs and customer are extremely sensitive with that regards.

Comment 3 Elad 2022-10-19 11:43:49 UTC
Proposing as a blocker for 4.12.0 based on comment #2

Comment 5 Axel Busch 2022-10-26 07:07:07 UTC
Can be closed -> considered in https://issues.redhat.com/browse/OCPBUGS-2556

Comment 6 Nitin Goyal 2022-10-26 07:33:02 UTC
Thanks for the confirmation, Alex. I really appreciate the quick response.

Closing the bug in the odf as the fix is in the OLM and Alex is working on it.


Note You need to log in before you can comment on or make changes to this bug.