Created attachment 2027074 [details] April 3 - 4 Memory Graph Description of problem (please be detailed as possible and provide log snippests): This is the first time I’ve seen this. The customer noticed the rook-ceph-operator pod seems to have the RAM increasing at an excessive rate and pod gets OOMKilled. We troubleshot the issue and the final troubleshooting step was giving the customer all new subs after purging Subs/CSVs and re-installing the odf-operator on the same stable channel however, this behavior is still present. We had a monitoring phase where the customer would submit memory graphs for the pod which correlated this behavior. Version of all relevant components (if applicable): OCP: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.15 True False 23d Cluster version is 4.14.15 ODF: NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.14.5-rhodf NooBaa Operator 4.14.5-rhodf mcg-operator.v4.14.4-rhodf Succeeded ocs-operator.v4.14.5-rhodf OpenShift Container Storage 4.14.5-rhodf ocs-operator.v4.14.4-rhodf Succeeded odf-csi-addons-operator.v4.14.5-rhodf CSI Addons 4.14.5-rhodf odf-csi-addons-operator.v4.14.4-rhodf Succeeded odf-operator.v4.14.5-rhodf OpenShift Data Foundation 4.14.5-rhodf odf-operator.v4.14.4-rhodf Succeeded Ceph: { "mon": { "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 3 }, "mgr": { "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 1 }, "osd": { "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 12 }, "mds": { "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 2 }, "rgw": { "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 1 }, "overall": { "ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)": 19 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No, but behavior is concerning, and depending on what rook is attempting to accomplish this may affect the cluster. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Additional info: I will attach some memory graphs submitted by the customer.
Craig, I'm looking into this. I'll provide an update soon.
is the customer still facing this issue?
Thanks for the update Craig. The mustgather attached to the case has `previous logs` file as empty and the `current` logs does not provide any useful information. Any chance I can get the Rook logs when before it goes to CLBO before due to memory? Those logs might provide some useful information.
Good Morning, In addition to my previous comment c#11, the customer has submitted fresh rook-ceph-operator logs and conveyed the following information: "rook-ceph-operator (rook-ceph-operator-844696dc89-rhwnf) got OOKilled/Evicted with the message The node was low on resource: memory. Threshold quantity: 100Mi, available: 46608Ki. Container rook-ceph-operator was using 83601708Ki, request is 0, has larger consumption of memory." I will attach the new set of operator logs for Engineering's review. Regards, Craig Wayman TSE Red Hat OpenShift Data Foundations (ODF) Customer Experience and Engagement, NA
Hi Craig, The latest rook-ceph-operator logs does not provide much info on what might be causing this issue. Could you provide some insights about the customer cluster like how many nodes and how many pods are running in total. I don't think anything else would be required from the customer at this time. But there is also an upstream issue similar to this, which indicates that the issue got resolved when upgrading to rook 1.13.8 (https://github.com/rook/rook/issues/14051#issuecomment-2089769475) and ODF 4.14 seems to be using rook 1.13.7. Still no answers as to what might be causing this.
Hi, Although the upstream issue (https://github.com/rook/rook/issues/14051) got fixed after upgrade to 1.13.8, we still don't know the root cause. So we can't really suggest the customer to upgrade without any RCA and the fact the upstream branches might also diverge a bit from the downstream patch releases.
Craig, I'll let you know if anything else is needed. This might need more investigation as its not clear from the logs what might be causing this memory leaks. One way is to try to reproduce this behavior locally. I'll try that out this week. (Moving this out of 4.16 as we have dev freeze on 14/5 and we are still investigating this BZ. Also its not a blocker for 4.16).
Hi Craig What is the latest update on this? Is the customer still facing this? are they using the cron job? Any idea if they are planning to upgrade?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days