Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1798284

Summary: clients failing to respond to cache pressure
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manjunatha <mmanjuna>
Component: CephFSAssignee: Jeff Layton <jlayton>
Status: CLOSED ERRATA QA Contact: Yogesh Mane <ymane>
Severity: medium Docs Contact:
Priority: high    
Version: 3.3CC: ceph-eng-bugs, dang, dfuller, ffilz, gsitlani, hyelloji, jlayton, kkeithle, pdonnell, sweil, tserlin, vereddy
Target Milestone: ---   
Target Release: 5.0   
Hardware: All   
OS: All   
Whiteboard: NeedsCherrypick
Fixed In Version: ceph-15.2.4 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-30 08:23:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1849725    
Bug Blocks:    

Comment 45 Daniel Gryniewicz 2020-04-09 13:10:28 UTC
The fact that L2 is empty generally means that reaping is happening.  In addition, if L2 is empty, it will try to reap from L1, if it can, so it will do it's best to reap unused entries if it's over HWMark.  (But note, it will never reap entries if it's under HWMark, so you will in general have ~HWMark entries in a stable, busy system.)  Reaping will attempt the LRU end of each lane, and then stop if it doesn't find anything, and allocate a new entry.

Reaping is separate from L2 vs. L1.  The levels of the LRU are intended to manage open global FDs (FDs used by NFSv3 or by anonymous access in NFSv4).  Entries in L1 may have an open global FD, entries in L2 do not.  The LRU_Run_Interval determines how often we scan L1 to demote entries to L2 and close the global FD.  This interval is variable, and will become shorter if the number of open global FDs is above it's FD_LWMark (and much faster if it's above FD_HWMark).

All that said, entries can be reaped (and reused) if they're at the tail of any lane at either L2 or L1, and their refcount is only 1.

Comment 46 Jeff Layton 2020-04-09 13:15:29 UTC
The LRU_Run_Interval is 90s, which seems like a _very_ long time -- long enough to really go way above the cache limits. What exactly does reaper_work_per_lane denote? If we're only trimming max 50 entries per lane and only every 90s, it seems plausible that we could go way over the hwmark.

Comment 47 Daniel Gryniewicz 2020-04-09 14:03:08 UTC
LRU_Run_Interval has nothing at all to do with reclaiming (reaping) entries.  It will not remove or free entries at all.  All it does is demote entries from L1 to L2, closing their global FD in the process.  It's about global FD management, not about entry management.

Entries are reaped via lru_try_reap_entry(), which is done on demand, every time a new entry is required.  Otherwise, entries are freed when they become invalid.

Comment 48 Jeff Layton 2020-04-09 14:58:50 UTC
Ok, so that would seem to rule out my hypothesis that this is being caused by ganesha's inability to reap old entries fast enough to keep up with the ones being added. So, we're probably left with two possibilities:

1) there is a refcount leak in ganesha that's causing entries to remain pinned in the cache

2) the working set of open files is just _that_ large.

Manjunatha, can you ask them about their workload here? Approximately how many files would they have open at a given time?

Comment 67 errata-xmlrpc 2021-08-30 08:23:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294