Bug 1551878
Summary: | [Ganesha] : Ganesha logs are flooded with "Futility Count Exceeded" messages. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ambarish <asoman> |
Component: | nfs-ganesha | Assignee: | Daniel Gryniewicz <dang> |
Status: | CLOSED ERRATA | QA Contact: | Manisha Saini <msaini> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.4 | CC: | bharat064015, bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, msaini, rhinduja, rhs-bugs, skoduri, storage-qa-internal |
Target Milestone: | --- | ||
Target Release: | RHGS 3.4.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | nfs-ganesha-2.5.5-7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-09-04 06:54:24 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1503137 |
Description
Ambarish
2018-03-06 04:15:59 UTC
I suspect this is https://github.com/nfs-ganesha/nfs-ganesha/commit/5c2efa8f077fafa82023f5aec5e2c474c5ed2fdf It this reproducible enough to test this? (In reply to Daniel Gryniewicz from comment #4) > I suspect this is > https://github.com/nfs-ganesha/nfs-ganesha/commit/ > 5c2efa8f077fafa82023f5aec5e2c474c5ed2fdf > > It this reproducible enough to test this? I am unaware of the reproducibility ATM. I've hit it only once , which is during prolonged IOs (plus some failovers/failbacks). I could consistently hit these log messages by running posix compliance test suit on V3 ganesha mount. Yeah, that fix is in 2.5.5, so it's not a FD leak. There's no known issues besides that one, so it's either legit (the workload is opening lots of files) or there's some new bug that we haven't seen before. It may be this, by code inspection: https://review.gerrithub.io/404816 (Note, this fix is "correct", it just may not fix this issue. I wonder if the fix in comment #13 is actually enough. I can't, even by fiddling with settings and hard-coding some extremely low settings, force ganesha into futility. It can always keep up closing FDs. Can we get a build with that patch, and test again? I believe the fix in comment #13 is correct and sufficient. Since it's merged, it should be backported now. I am able to repro the issue by running dbench test suit on ganesha v3 mount # rpm -qa | grep ganesha glusterfs-ganesha-3.12.2-8.el7rhgs.x86_64 nfs-ganesha-2.5.5-5.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-5.el7rhgs.x86_64 snippet--- ============ exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. 23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. ================ Moving this BZ to assigned state Can you share your cache_inode settings and your dbench command line? I can't reproduce even with the worst possible cache_inode settings... Okay, I can only get this when Entries_HWMark = 25000. If I don't set that, I cannot get any futility messages. I'm investigating if there's any way around this, but my question is: why is this value set so low? (In reply to Daniel Gryniewicz from comment #22) > Okay, I can only get this when Entries_HWMark = 25000. If I don't set that, > I cannot get any futility messages. I'm investigating if there's any way > around this, but my question is: why is this value set so low? During earlier stages of nfs-ganesha product development, there were lot of mem-leak related issues. So at that time setting low value for Entries_HWMark may help in decrease of memory utilization. Later we found those mem leak are not related to Cache inode Entries. But still we keep that value for d/s. Okay, just for the record, this is not enough entries for a large, heavily loaded server. It will, at the very least, tank performance. If we can raise it, that would be good. This may not be fixable if that config value needs to be kept. I'm not sure yet, I'm still looking. I've looked this over, and discussed it with Frank, and we've decided that this message is no longer critical, and is poorly worded. I've submitted an upstream change to make this message more useful, and to lower the log level to Warn, and rate-limit it to only when the futility boundary is crossed, rather than every time the thread runs after that. That change is here: https://review.gerrithub.io/#/c/ffilz/nfs-ganesha/+/410293 MDCACHE Hey, just suppressing these messages won't help much. As mentioned in comment #7, this issue is consistently reproducible with extensive IO durations (2+ days) on a 2.5.2 release. It's triggered when open_fd_count exceeds the allowed threshold. https://sourceforge.net/p/nfs-ganesha/mailman/message/36174489/ I even tried with this patch set https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/391267 mdcache_lru_fds_available :INODE LRU :INFO :FDs above high water mark, waking LRU thread. open_fd_count: 4759, fds_hiwat: 3686 lru_run :INODE LRU :CRIT :Futility count exceeded. The LRU thread is unable to make progress in reclaiming FDs. Disabling FD cache. I am running nfsV3 mount NFS-Ganesha Release = V2.5.2 But that's the point: The message is misleading and no longer useful, so suppressing it is the fix, not a workaround. Ganesha is *designed* to handle being over highwater, and it will recover eventually, but spamming this message is, in-and-of-itself, a problem: It causes admins to think something is gravely wrong, and it fills up logging devices. The futility system has undergone radical changes over the last major versions of Ganesha, and this message is no longer useful. Ok, so we raise the futility message only once. What about this one mdcache_lru_fds_available :INODE LRU :INFO :FDs above high water mark, waking LRU thread. Already fixed by this: https://github.com/nfs-ganesha/nfs-ganesha/commit/500c754df3a46d77f88036e922424e49af1f9b65 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2610 |