Bug 1551878

Summary:	[Ganesha] : Ganesha logs are flooded with "Futility Count Exceeded" messages.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ambarish <asoman>
Component:	nfs-ganesha	Assignee:	Daniel Gryniewicz <dang>
Status:	CLOSED ERRATA	QA Contact:	Manisha Saini <msaini>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	bharat064015, bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, msaini, rhinduja, rhs-bugs, skoduri, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	nfs-ganesha-2.5.5-7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-04 06:54:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1503137

Description Ambarish 2018-03-06 04:15:59 UTC

Description of problem:
-----------------------

6 node cluster, 6 clients throwing IO from v3/v4.

IO type : writes , unatr ,rms,finds.

Couple of nodes were rebooted , service were killed etc to simulate failovers and failbacks.

I kept IO running for ~3 days and I see a steady stream of the following log message in ganesha.log :


02/03/2018 05:11:09 : epoch 58f00000 : gqas006.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-340[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
02/03/2018 05:11:09 : epoch 58f00000 : gqas006.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-340[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
02/03/2018 05:11:09 : epoch 58f00000 : gqas006.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-340[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.




This message popped up 2945842 times and my log size has been bloasted to almost 700M :

[root@gqas006 ganesha]# cat /var/log/ganesha/ganesha.log |grep "Futility count exceeded"|wc -l
2945842

[root@gqas006 ganesha]# ls -lh
total 1.2G
-rw-------. 1 root root 534M Mar  5 07:25 ganesha-gfapi.log
-rw-r--r--. 1 root root 692M Mar  5 07:29 ganesha.log
[root@gqas006 ganesha]# 
 
This is a concern wrt usability.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-ganesha-3.12.2-4.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-2.el7rhgs.x86_64


How reproducible:
-----------------

1/1

Steps to Reproduce:
-------------------

Unknown , but this is what I did :

Prolonged IO , with simulated failovers,failbacks

Comment 4 Daniel Gryniewicz 2018-03-06 15:23:02 UTC

I suspect this is https://github.com/nfs-ganesha/nfs-ganesha/commit/5c2efa8f077fafa82023f5aec5e2c474c5ed2fdf

It this reproducible enough to test this?

Comment 5 Ambarish 2018-03-06 15:32:17 UTC

(In reply to Daniel Gryniewicz from comment #4)
> I suspect this is
> https://github.com/nfs-ganesha/nfs-ganesha/commit/
> 5c2efa8f077fafa82023f5aec5e2c474c5ed2fdf
> 
> It this reproducible enough to test this?

I am unaware of the reproducibility ATM.

I've hit it only once , which is during prolonged IOs (plus some failovers/failbacks).

Comment 7 Manisha Saini 2018-03-14 18:07:57 UTC

I could consistently hit these log messages by running posix compliance test suit on V3 ganesha mount.

Comment 11 Daniel Gryniewicz 2018-03-19 18:04:07 UTC

Yeah, that fix is in 2.5.5, so it's not a FD leak.  There's no known issues besides that one, so it's either legit (the workload is opening lots of files) or there's some new bug that we haven't seen before.

Comment 13 Daniel Gryniewicz 2018-03-22 13:39:57 UTC

It may be this, by code inspection:

https://review.gerrithub.io/404816

(Note, this fix is "correct", it just may not fix this issue.

Comment 16 Daniel Gryniewicz 2018-04-05 18:22:43 UTC

I wonder if the fix in comment #13 is actually enough.  I can't, even by fiddling with settings and hard-coding some extremely low settings, force ganesha into futility.  It can always keep up closing FDs.

Can we get a build with that patch, and test again?

Comment 17 Daniel Gryniewicz 2018-04-19 14:23:13 UTC

I believe the fix in comment #13 is correct and sufficient.  Since it's merged, it should be backported now.

Comment 18 Manisha Saini 2018-04-23 15:13:46 UTC

I am able to repro the issue by running dbench test suit on ganesha v3 mount

# rpm -qa | grep ganesha
glusterfs-ganesha-3.12.2-8.el7rhgs.x86_64
nfs-ganesha-2.5.5-5.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-5.el7rhgs.x86_64


snippet---

============
exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.
23/04/2018 20:40:32 : epoch c0930000 : tettnang.lab.eng.blr.redhat.com : ganesha.nfsd-25229[cache_lru] lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.

================


Moving this BZ to assigned state

Comment 19 Daniel Gryniewicz 2018-04-23 18:47:00 UTC

Can you share your cache_inode settings and your dbench command line?  I can't reproduce even with the worst possible cache_inode settings...

Comment 22 Daniel Gryniewicz 2018-05-02 15:31:02 UTC

Okay, I can only get this when Entries_HWMark = 25000.  If I don't set that, I cannot get any futility messages.  I'm investigating if there's any way around this, but my question is: why is this value set so low?

Comment 23 Jiffin 2018-05-02 15:35:35 UTC

(In reply to Daniel Gryniewicz from comment #22)
> Okay, I can only get this when Entries_HWMark = 25000.  If I don't set that,
> I cannot get any futility messages.  I'm investigating if there's any way
> around this, but my question is: why is this value set so low?

During earlier stages of nfs-ganesha product development, there were lot of mem-leak related issues. So at that time setting low value for Entries_HWMark may help in decrease of memory utilization. Later we found those mem leak are not related to Cache inode Entries. But still we keep that value for d/s.

Comment 24 Daniel Gryniewicz 2018-05-02 17:48:14 UTC

Okay, just for the record, this is not enough entries for a large, heavily loaded server.  It will, at the very least, tank performance.  If we can raise it, that would be good.

This may not be fixable if that config value needs to be kept.  I'm not sure yet, I'm still looking.

Comment 25 Daniel Gryniewicz 2018-05-07 18:15:22 UTC

I've looked this over, and discussed it with Frank, and we've decided that this message is no longer critical, and is poorly worded.  I've submitted an upstream change to make this message more useful, and to lower the log level to Warn, and rate-limit it to only when the futility boundary is crossed, rather than every time the thread runs after that.  That change is here:

https://review.gerrithub.io/#/c/ffilz/nfs-ganesha/+/410293 MDCACHE

Comment 26 Bharat 2018-06-05 19:23:32 UTC

Hey, just suppressing these messages won't help much.
As mentioned in comment #7, this issue is consistently reproducible with extensive IO durations (2+ days) on a 2.5.2 release.

It's triggered when open_fd_count exceeds the allowed threshold.
https://sourceforge.net/p/nfs-ganesha/mailman/message/36174489/

I even tried with this patch set https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/391267

mdcache_lru_fds_available :INODE LRU :INFO :FDs above high water mark, waking LRU thread. open_fd_count: 4759, fds_hiwat: 3686
lru_run :INODE LRU :CRIT :Futility count exceeded.  The LRU thread is unable to make progress in reclaiming FDs.  Disabling FD cache.

Comment 27 Bharat 2018-06-05 19:39:49 UTC

I am running nfsV3 mount
NFS-Ganesha Release = V2.5.2

Comment 28 Daniel Gryniewicz 2018-06-06 13:40:15 UTC

But that's the point: The message is misleading and no longer useful, so suppressing it is the fix, not a workaround.

Ganesha is *designed* to handle being over highwater, and it will recover eventually, but spamming this message is, in-and-of-itself, a problem: It causes admins to think something is gravely wrong, and it fills up logging devices.

The futility system has undergone radical changes over the last major versions of Ganesha, and this message is no longer useful.

Comment 29 Bharat 2018-06-06 14:00:27 UTC

Ok, so we raise the futility message only once. What about this one 
mdcache_lru_fds_available :INODE LRU :INFO :FDs above high water mark, waking LRU thread.

Comment 30 Daniel Gryniewicz 2018-06-06 14:23:59 UTC

Already fixed by this: https://github.com/nfs-ganesha/nfs-ganesha/commit/500c754df3a46d77f88036e922424e49af1f9b65

Comment 33 errata-xmlrpc 2018-09-04 06:54:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2610