Bug 2275190 - Possible Memory Leak on the rook-ceph-operator Resource
Summary: Possible Memory Leak on the rook-ceph-operator Resource
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Santosh Pillai
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-04-15 21:38 UTC by Craig Wayman
Modified: 2024-11-23 04:25 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-07-23 15:15:22 UTC
Embargoed:


Attachments (Terms of Use)

Description Craig Wayman 2024-04-15 21:38:34 UTC
Created attachment 2027074 [details]
April 3 - 4 Memory Graph

Description of problem (please be detailed as possible and provide log
snippests): 

  This is the first time I’ve seen this. The customer noticed the rook-ceph-operator pod seems to have the RAM increasing at an excessive rate and pod gets OOMKilled. We troubleshot the issue and the final troubleshooting step was giving the customer all new subs after purging Subs/CSVs and re-installing the odf-operator on the same stable channel however, this behavior is still present. 

  We had a monitoring phase where the customer would submit memory graphs for the pod which correlated this behavior.


Version of all relevant components (if applicable):

OCP:

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.15   True        False         23d     Cluster version is 4.14.15


ODF:

NAME                                      DISPLAY                                          VERSION        REPLACES                                  PHASE
mcg-operator.v4.14.5-rhodf                NooBaa Operator                                  4.14.5-rhodf   mcg-operator.v4.14.4-rhodf                Succeeded
ocs-operator.v4.14.5-rhodf                OpenShift Container Storage                      4.14.5-rhodf   ocs-operator.v4.14.4-rhodf                Succeeded
odf-csi-addons-operator.v4.14.5-rhodf     CSI Addons                                       4.14.5-rhodf   odf-csi-addons-operator.v4.14.4-rhodf     Succeeded
odf-operator.v4.14.5-rhodf                OpenShift Data Foundation                        4.14.5-rhodf   odf-operator.v4.14.4-rhodf                Succeeded

Ceph:

{
    "mon": {
        "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 1
    },
    "osd": {
        "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 12
    },
    "mds": {
        "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 2
    },
    "rgw": {
        "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 1
    },
    "overall": {
        "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 19
    }
}



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No, but behavior is concerning, and depending on what rook is attempting to accomplish this may affect the cluster.


Is there any workaround available to the best of your knowledge?

No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Additional info:

  I will attach some memory graphs submitted by the customer.

Comment 6 Santosh Pillai 2024-04-23 05:51:48 UTC
Craig, I'm looking into this. I'll provide an update soon.

Comment 7 Santosh Pillai 2024-04-23 05:54:01 UTC
is the customer still facing this issue?

Comment 10 Santosh Pillai 2024-04-23 16:19:09 UTC
Thanks for the update Craig.

The mustgather attached to the case has `previous logs` file as empty and the `current` logs does not provide any useful information.
 
Any chance I can get the Rook logs when before it goes to CLBO before due to memory? Those logs might provide some useful information.

Comment 12 Craig Wayman 2024-04-24 14:55:48 UTC
Good Morning, 

  In addition to my previous comment c#11, the customer has submitted fresh rook-ceph-operator logs and conveyed the following information:

"rook-ceph-operator (rook-ceph-operator-844696dc89-rhwnf) got OOKilled/Evicted with the message The node was low on resource: memory. Threshold quantity: 100Mi, available: 46608Ki. Container rook-ceph-operator was using 83601708Ki, request is 0, has larger consumption of memory."

  I will attach the new set of operator logs for Engineering's review.

Regards,


Craig Wayman
TSE Red Hat OpenShift Data Foundations (ODF) 
Customer Experience and Engagement, NA

Comment 15 Santosh Pillai 2024-05-02 08:53:38 UTC
Hi Craig, 

The latest rook-ceph-operator logs does not provide much info on what might be causing this issue. 
Could you provide some insights about the customer cluster like how many nodes and how many pods are running in total. 
I don't think anything else would be required from the customer at this time. 

But there is also an upstream issue similar to this, which indicates that the issue got resolved when upgrading to rook 1.13.8 (https://github.com/rook/rook/issues/14051#issuecomment-2089769475) and ODF 4.14 seems to be using rook 1.13.7. 

Still no answers as to what might be causing this.

Comment 19 Santosh Pillai 2024-05-08 04:26:45 UTC
Hi,
Although the upstream issue (https://github.com/rook/rook/issues/14051) got fixed after upgrade to 1.13.8, we still don't know the root cause. So we can't really suggest the customer to upgrade without any RCA and the fact the upstream branches might also diverge a bit from the downstream patch releases.

Comment 21 Santosh Pillai 2024-05-13 02:53:22 UTC
Craig, I'll let you know if anything else is needed. This might need more investigation as its not clear from the logs what might be causing this memory leaks. One way is to try to reproduce this behavior locally. I'll try that out this week. 


(Moving this out of 4.16 as we have dev freeze on 14/5 and we are still investigating this BZ. Also its not a blocker for 4.16).

Comment 28 Santosh Pillai 2024-06-10 05:08:10 UTC
Hi Craig

What is the latest update on this? Is the customer still facing this? are they using the cron job?  Any idea if they are planning to upgrade?

Comment 31 Red Hat Bugzilla 2024-11-23 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.