Bug 2262323 - [IBM Support] Both MDS in consistent CLBO due to OOM: !normal: cannot ephemeral random pin [inode seen [NEEDINFO]
Summary: [IBM Support] Both MDS in consistent CLBO due to OOM: !normal: cannot ephemer...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.12
Hardware: All
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Venky Shankar
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-02-01 20:04 UTC by Mike Hackett
Modified: 2024-09-12 16:54 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
vshankar: needinfo? (assingh)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 64348 0 None None None 2024-02-08 04:54:55 UTC

Description Mike Hackett 2024-02-01 20:04:51 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Both active and Primay MDS enter CLBO consistently due to OOM. We increased default memory from 8GiB to 16Gib and this did not resolve the issue. Memory consumption on the nodes is around 45% with 16GiB set for the MDS's.

We set debug on primary and secondary MDS to catch and issues during boot before crash and during boot of the MDS after OOM we can see the following messages in the standby MDS:

!normal: cannot ephemeral random pin [inode 0x1000136d2ca [2,head] /volumes/csi/csi-vol-fe6bdadf-2e7a-4684-b340-f7779c93c974/f07968b8-5ab2-4033-8cfe-10326680dbb3/content_repository/361/1706680874886-7132521 auth v30188 s=0 n(v0 1=1+0) (iversion lock) 0x563159fff080] 2024-02-01T18:11:34.288+0000 7f4baf916700 20 mds.0.cache.dir(0x100000091ae) _fetched pos 6794 marker 'i' dname '1706680874282-7131497 [2,head]

Working with prashant Dhange we reviewed mds_oft_prefetch_dirfrags which was set to default false.

We have gathered latest mustgather and debug logs from primary and standby MDS for review.

Need assistance determining cause for OOM and memory consumption and possible resolution. 

Version of all relevant components (if applicable):
ODF 4.12.2
RH Ceph 5.1z1

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, Both MDS are down. Impact to customer applications consuming file based PVC's.

Is there any workaround available to the best of your knowledge?
No, we cannot get MDS to stay up.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4

Can this issue reproducible?
In customer environement, yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
unaware

Steps to Reproduce:
Issue hit in Customer's cluster


Actual results:
MDS goes into OOM

Expected results:
MDS to not OOM

Additional info:

IBM Case number: TS015350570

Comment 25 Venky Shankar 2024-02-08 04:54:56 UTC
@all, tracker https://tracker.ceph.com/issues/64348 to investigate a possible memleak in up:rejoin.

Will have someone work on it on prio. We should also be prepared with what debug logs to gather the next time some customer runs into this.


Note You need to log in before you can comment on or make changes to this bug.