1240172 – nfs-ganesha: OOM kill of nfs-ganesha process on all nodes while executing iozone

Bug 1240172 - nfs-ganesha: OOM kill of nfs-ganesha process on all nodes while executing iozone

Summary: nfs-ganesha: OOM kill of nfs-ganesha process on all nodes while executing iozone

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Soumya Koduri
QA Contact:	Shashank Raj
Docs Contact:
URL:
Whiteboard:
Depends On:	1295107 1311441
Blocks:	1282669 1299183 1337867
TreeView+	depends on / blocked

Reported:	2015-07-06 06:33 UTC by Saurabh
Modified:	2019-11-14 06:47 UTC (History)
CC List:	18 users (show)
Fixed In Version:	nfs-ganesha-2.3.1-5
Doc Type:	Bug Fix
Doc Text:	Previously, NFS-ganesha tried to reap the inodes when any share is being unexported or if the number of entries in the cache reaches the limit (default: 100000). But gfapi did not flush those inodes immediately and it resulted in the nfs-ganesha process consuming lots of memory. As a consequence, NFS-ganesha process would get OOM killed due to excessive memory usage. With this fix, gfapi immediately flushes the inodes in which NFS-ganesha releases and also to limit the number of entries nfs-ganesha can cache by adding configuration option to '/etc/ganesha/ganesha.conf' file. This will control the memory usage of NFS-ganesha process to some extent.
Clone Of:
Environment:
Last Closed:	2016-06-23 05:32:09 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sosreport of nfs11 (14.81 MB, application/x-xz) 2015-07-06 06:38 UTC, Saurabh	no flags	Details
nfs11 ganesha.log (107.49 KB, text/plain) 2015-07-06 06:51 UTC, Saurabh	no flags	Details
dmesg log from nfs11 (36.68 KB, text/plain) 2015-07-06 06:58 UTC, Saurabh	no flags	Details
nfs11 ganesha-gfapi.log (2.86 MB, application/x-xz) 2015-07-06 07:01 UTC, Saurabh	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:1247	0	normal	SHIPPED_LIVE	nfs-ganesha update for Red Hat Gluster Storage 3.1 update 3	2016-06-23 09:12:43 UTC

Description Saurabh 2015-07-06 06:33:25 UTC

Description of problem:
I was executing iozone on nfs-ganesha process and saw that nfs-ganesha was getting killed on each node one after the other, there on I found that there is a OOM kill for nfs-ganesha process.


ganesha.nfsd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
ganesha.nfsd cpuset=/ mems_allowed=0
Pid: 19476, comm: ganesha.nfsd Not tainted 2.6.32-570.el6.x86_64 #1
Call Trace:
 [<ffffffff810d6dd1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff8112a5d0>] ? dump_header+0x90/0x1b0
 [<ffffffff810d7e71>] ? cpuset_mems_allowed_intersects+0x21/0x30
 [<ffffffff8112aa52>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff8112a991>] ? select_bad_process+0xe1/0x120
 [<ffffffff8112ae90>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8113786c>] ? __alloc_pages_nodemask+0x93c/0x950
 [<ffffffffa00671b4>] ? do_get_write_access+0x3b4/0x520 [jbd2]
 [<ffffffff8117035a>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff81152b1d>] ? handle_pte_fault+0x73d/0xb20
 [<ffffffff8105e173>] ? __wake_up+0x53/0x70
 [<ffffffff81153199>] ? handle_mm_fault+0x299/0x3d0
 [<ffffffffa00a5e48>] ? __ext4_journal_stop+0x68/0xa0 [ext4]
 [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500
 [<ffffffff81127ffa>] ? generic_file_buffered_write+0x1da/0x2e0
 [<ffffffff8107e9b7>] ? current_fs_time+0x27/0x30
 [<ffffffff8153e76e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8153bb15>] ? page_fault+0x25/0x30
 [<ffffffff81126b44>] ? file_read_actor+0x44/0x180
 [<ffffffff81238cc1>] ? avc_has_perm+0x71/0x90
 [<ffffffff81129056>] ? generic_file_aio_read+0x2d6/0x700
 [<ffffffff8119191a>] ? do_sync_read+0xfa/0x140
 [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8123e72b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff81231496>] ? security_file_permission+0x16/0x20
 [<ffffffff81192215>] ? vfs_read+0xb5/0x1a0
 [<ffffffff81192fef>] ? fget_light_pos+0x3f/0x50
 [<ffffffff81192561>] ? sys_read+0x51/0xb0
 [<ffffffff810e88ae>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b


Version-Release number of selected component (if applicable):
glusterfs-3.7.1-7.el6rhs.x86_64
nfs-ganesha-2.2.0-3.el6rhs.x86_64

How reproducible:
happened on all nodes

Steps to Reproduce:
1. post configuring nfs-ganesha, executing iozone over vers=4 mount

Actual results:
[root@nfs11 ~]# gluster volume info vol2
 
Volume Name: vol2
Type: Distributed-Replicate
Volume ID: 30ab7484-1480-46d5-8f83-4ab27199640d
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.46.8:/rhs/brick1/d1r12
Brick2: 10.70.46.27:/rhs/brick1/d1r22
Brick3: 10.70.46.25:/rhs/brick1/d2r12
Brick4: 10.70.46.29:/rhs/brick1/d2r22
Brick5: 10.70.46.8:/rhs/brick1/d3r12
Brick6: 10.70.46.27:/rhs/brick1/d3r22
Brick7: 10.70.46.25:/rhs/brick1/d4r12
Brick8: 10.70.46.29:/rhs/brick1/d4r22
Brick9: 10.70.46.8:/rhs/brick1/d5r12
Brick10: 10.70.46.27:/rhs/brick1/d5r22
Brick11: 10.70.46.25:/rhs/brick1/d6r12
Brick12: 10.70.46.29:/rhs/brick1/d6r22
Options Reconfigured:
performance.readdir-ahead: on
features.cache-invalidation: on
ganesha.enable: on
nfs.disable: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
nfs-ganesha: enable

[root@nfs11 ~]# gluster volume status vol2
Status of volume: vol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.8:/rhs/brick1/d1r12          49156     0          Y       2665 
Brick 10.70.46.27:/rhs/brick1/d1r22         49156     0          Y       20958
Brick 10.70.46.25:/rhs/brick1/d2r12         49156     0          Y       17883
Brick 10.70.46.29:/rhs/brick1/d2r22         49155     0          Y       20935
Brick 10.70.46.8:/rhs/brick1/d3r12          49157     0          Y       2684 
Brick 10.70.46.27:/rhs/brick1/d3r22         49157     0          Y       20977
Brick 10.70.46.25:/rhs/brick1/d4r12         49157     0          Y       17902
Brick 10.70.46.29:/rhs/brick1/d4r22         49156     0          Y       20954
Brick 10.70.46.8:/rhs/brick1/d5r12          49158     0          Y       2703 
Brick 10.70.46.27:/rhs/brick1/d5r22         49158     0          Y       20996
Brick 10.70.46.25:/rhs/brick1/d6r12         49158     0          Y       17921
Brick 10.70.46.29:/rhs/brick1/d6r22         49157     0          Y       20973
Self-heal Daemon on localhost               N/A       N/A        Y       2341 
Quota Daemon on localhost                   N/A       N/A        Y       2352 
Self-heal Daemon on 10.70.46.25             N/A       N/A        Y       7479 
Quota Daemon on 10.70.46.25                 N/A       N/A        Y       7484 
Self-heal Daemon on 10.70.46.29             N/A       N/A        Y       9905 
Quota Daemon on 10.70.46.29                 N/A       N/A        N       N/A  
Self-heal Daemon on 10.70.46.22             N/A       N/A        Y       7660 
Quota Daemon on 10.70.46.22                 N/A       N/A        Y       7668 
Self-heal Daemon on 10.70.46.27             N/A       N/A        Y       10010
Quota Daemon on 10.70.46.27                 N/A       N/A        N       N/A  
 
Task Status of Volume vol2
------------------------------------------------------------------------------
There are no active volume tasks




ganesha.log,

05/07/2015 16:55:46 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
05/07/2015 16:56:46 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE
05/07/2015 20:04:49 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60
05/07/2015 20:04:49 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip nfs14
05/07/2015 20:04:49 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/nfs14/v4recov)
05/07/2015 20:04:56 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
05/07/2015 20:05:56 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE
05/07/2015 21:07:28 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60
05/07/2015 21:07:28 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip nfs13
05/07/2015 21:07:28 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/nfs13/v4recov)
05/07/2015 21:07:36 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
05/07/2015 21:08:36 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE
05/07/2015 22:06:11 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60
05/07/2015 22:06:11 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip nfs12
05/07/2015 22:06:11 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/nfs12/v4recov)
05/07/2015 22:06:14 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[work-10] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE
05/07/2015 22:06:23 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
05/07/2015 22:06:24 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
05/07/2015 22:06:25 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
05/07/2015 22:06:26 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
05/07/2015 22:06:27 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
05/07/2015 22:06:47 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
05/07/2015 22:06:54 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
05/07/2015 22:06:58 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
05/07/2015 22:07:17 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE


Expected results:

No OOM kill to be seen for nfs-ganesha process.
Additional info:

Comment 2 Saurabh 2015-07-06 06:38:04 UTC

Created attachment 1048675 [details]
sosreport of nfs11

Comment 3 Saurabh 2015-07-06 06:51:21 UTC

Created attachment 1048679 [details]
nfs11 ganesha.log

Comment 4 Saurabh 2015-07-06 06:58:45 UTC

Created attachment 1048680 [details]
dmesg log from nfs11

Comment 5 Saurabh 2015-07-06 07:01:47 UTC

Created attachment 1048681 [details]
nfs11 ganesha-gfapi.log

Comment 6 Soumya Koduri 2015-07-06 10:30:28 UTC

As per the discussion with Saurabh -->
* iozone tests were ran successfully in the earlier builds/setup.
* The difference we see with this setup is that quota feature is ON
* In addition, it seems that this was not hit just after plain mount.
There was a failover triggered by killing the nfs-ganesha service manually after around 15-min the time the test has started. Post that after some time nfs-ganesha processes got auto-killed due to OOM issues on all other nodes as well.

We would like to run the test on following scenarios - 
* plain nfs-ganesha mount
* with quota feature on
* with failover test in parellel
* with quota feature on and failover test in parallel 
and compare the results.

Comment 7 Saurabh 2015-07-07 12:29:50 UTC

I tried to reproduce the OOM but this attempt of similar test I could not see the issue.

Comment 9 Niels de Vos 2015-07-20 12:52:22 UTC

Please include details about the exact iozone command that was run, and how many clients are used.

Comment 10 Saurabh 2015-07-21 09:01:17 UTC

I used "iozone -a" from one client only.

Comment 12 Soumya Koduri 2016-02-10 05:31:59 UTC

We have requested QE to re-test and reproduce the issue. Also there are few gfAPI mem leak issues being worked upon in the upstream as part of bug1295107. If we could reproduce this issue, we can apply those patches and retest it.

Comment 13 Soumya Koduri 2016-03-10 09:44:57 UTC

The fixes merged in upstream which can be taken for this bug are -

http://review.gluster.org/13125
http://review.gluster.org/13096 
http://review.gluster.org/#/c/13232/

Comment 25 Soumya Koduri 2016-04-12 10:00:02 UTC

Here are few of my findings/experiments done to determine cache_inode limit.

From gdb and code, looks like total size occupied for each inode (including context) in a distributed replicated volume is 

(gdb) p sizeof(struct cache_entry_t) + sizeof(struct glusterfs_handle) + sizeof(struct glfs_object) + sizeof(afr_inode_ctx_t) + sizeof(dht_inode_ctx_t) + sizeof(ioc_inode_t) + sizeof(struct md_cache) + sizeof(struct ios_stat) + sizeof(uint64_t) + sizeof(wb_inode_t) + sizeof(qr_inode_t)
$18 = 2296

i.e, around ~2k bytes.


But when I tried to monitor the process memory size when data is being written to a newly created file, I see varied results -
[root@dhcp43-58 ~]# showmount -e
Export list for dhcp43-58.lab.eng.blr.redhat.com:
[root@dhcp43-58 ~]# ps aux | grep ganesha
root     19044  0.2  0.4 1360752 8240 ?        Ssl  15:03   0:00 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     19172  0.0  0.1 117000  2144 pts/0    S+   15:03   0:00 grep --color=auto ganesha
[root@dhcp43-58 ~]# cat /proc/19044/oom_score
2

Exporting a volume 'testvol'

[root@dhcp43-58 ~]# showmount -e
Export list for dhcp43-58.lab.eng.blr.redhat.com:
/testvol (everyone)
[root@dhcp43-58 ~]# cat /proc/19044/oom_score
14
[root@dhcp43-58 ~]# ps aux | grep ganesha
root     19044  0.3  3.0 1636920 61904 ?       Ssl  15:03   0:00 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     19501  0.0  0.1 117000  2196 pts/0    S+   15:04   0:00 grep --color=auto ganesha
[root@dhcp43-58 ~]# 

i.e, the RSS value increased by ~54M and oom_score by 12

Exporting another volume -

[root@dhcp43-58 ~]# showmount -e
Export list for dhcp43-58.lab.eng.blr.redhat.com:
/testvol (everyone)
/test    (everyone)
[root@dhcp43-58 ~]# cat /proc/19044/oom_score
26
[root@dhcp43-58 ~]# ps aux | grep ganesha
root     19044  0.2  5.3 1750168 110440 ?      Ssl  15:03   0:00 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     20347  0.0  0.1 117000  2244 pts/0    S+   15:06   0:00 grep --color=auto ganesha
[root@dhcp43-58 ~]# 

>>> the RSS value increased by ~49M and oom_score by 12

[root@dhcp43-58 ~]# showmount -e
Export list for dhcp43-58.lab.eng.blr.redhat.com:
/testvol (everyone)
/test    (everyone)
/new1    (everyone)
[root@dhcp43-58 ~]# cat /proc/19044/oom_score
37
[root@dhcp43-58 ~]# 
[root@dhcp43-58 ~]# ps aux | grep ganesha
root     19044  0.3  7.7 1863416 158888 ?      Ssl  15:03   0:01 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     21308  0.0  0.1 117004  2228 pts/0    S+   15:08   0:00 grep --color=auto ganesha
[root@dhcp43-58 ~]# 

>>> the RSS value increased by ~49M and oom_score by 11

So, leaving the first volume, on an avg., each volume export resulted in increase in memory by ~49M.


[root@skoduri ~]# mount -t nfs 10.70.43.58:/testvol /mnt

[root@dhcp43-58 ~]# ps aux | grep ganesha
root     19044  0.3  7.7 1863416 158904 ?      Ssl  15:03   0:01 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     22175  0.0  0.1 117004  2204 pts/0    S+   15:10   0:00 grep --color=auto ganesha
[root@dhcp43-58 ~]# 


After writing into 100 files and reading the data

[root@skoduri mnt]# for i in `seq 1 100`; do  echo 3 > /proc/sys/vm/drop_caches;dd if=/dev/zero of=/mnt/file$i bs=100K count=1 conv=sync & done ;
[root@skoduri mnt]# cat *

[root@dhcp43-58 ~]# ps aux | grep ganesha
root     19044  0.5  7.9 1863416 161756 ?      Ssl  15:03   0:03 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     23486  0.0  0.1 117000  2096 pts/0    S+   15:13   0:00 grep --color=auto ganesha
[root@dhcp43-58 ~]# 

[root@dhcp43-58 ~]# cat /proc/19044/oom_score
38
[root@dhcp43-58 ~]# 


So on an avg, each file resulted in increase in memory by (161756-158904)/100 = ~28K

But further creations of 100 new files showed different values -

[root@dhcp43-58 ~]# ps aux | grep ganesha
root     19044  0.5  7.9 1863416 161768 ?      Ssl  15:03   0:04 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     24589  0.0  0.1 117000  2132 pts/0    S+   15:16   0:00 grep --color=auto ganesha
[root@dhcp43-58 ~]# 

[root@skoduri mnt]# for i in `seq 101 200`; do  echo 3 > /proc/sys/vm/drop_caches;dd if=/dev/zero of=/mnt/file$i bs=100K count=1 conv=sync & done ;
[root@skoduri mnt]# cat *

[root@dhcp43-58 ~]# ps aux | grep ganesha
root     19044  0.6  8.2 1867512 169272 ?      Ssl  15:03   0:06 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT
root     25505  0.0  0.1 117000  2192 pts/0    R+   15:18   0:00 grep --color=auto ganesha
[root@dhcp43-58 ~]# 

That means on an avg of ~75K increase per file.

I have requested Manoj(mpillai) from perf team to do longevity tests and monitor nfs-ganesha process memory usage. We would like to determine following things
a)determine an optimal value to set cache_inode limit 
b)if there are any other major leaks which we have missed.

From https://bugzilla.redhat.com/show_bug.cgi?id=1282669#c4,

Nov 13 14:53:26 sv-2000lvp88 kernel: Out of memory: Kill process 28790 (ganesha.nfsd) score 855 or sacrifice child
Nov 13 14:53:26 sv-2000lvp88 kernel: Killed process 28790 (ganesha.nfsd) total-vm:19276892kB, anon-rss:15252240kB, file-rss:0kB

Nov 15 02:35:27 sv-2000lvp87 kernel: Out of memory: Kill process 50329 (ganesha.nfsd) score 853 or sacrifice child
Nov 15 02:35:27 sv-2000lvp87 kernel: Killed process 50329 (ganesha.nfsd) total-vm:19244616kB, anon-rss:15307828kB, file-rss:0kB

Looks like on the customer setup, nfs-ganesha had consumed around ~14GB and had oom_score >800. We may need to tweak the oom_score calculated for this process (if possible).

Comment 44 Anoop 2016-05-17 08:44:08 UTC

Kindly let me know when we will have a final build with all ther required patches.

Comment 45 Shashank Raj 2016-05-18 06:37:50 UTC

Based on comment 43 and 44, we are still waiting for the new build with relevant fixes. Once we have that, will start the verification process for hotfix.

Comment 46 Shashank Raj 2016-05-20 09:59:57 UTC

Verified this bug with latest glusterfs and ganesha builds and the original issue "OOM kill of nfs-ganesha process while executing iozone" is not reproducible.

Tried with below scenarios:

* plain nfs-ganesha mount
* On volume with quota enabled.
* On volume with quota and failover/failback

The customer bug (https://bugzilla.redhat.com/show_bug.cgi?id=1282669) and the ganesha memory consumption related issues will now be tracked as part of (https://bugzilla.redhat.com/show_bug.cgi?id=1337867) bug.

For any hotfix related requests, please update bug 1337867 with the details.

Based on above explanation, amrking this bug as Verified.

Comment 50 Divya 2016-06-14 10:40:47 UTC

Soumya,

Could you review and sign-off the edited doc text.

Comment 51 Soumya Koduri 2016-06-14 10:50:07 UTC

Divya,

doc_text looks good to me. Few minor corrections needed for the below statement -

>>> With this fix, gfapi immediately flushes the inodes in which NFS-ganesha releases and also to limit the number of entries nfs-ganesha can cache by adding configuration option to '/etc/ganesha/ganesha.conf' file.

Update: With this fix, gfapi immediately flushes the inodes which NFS-ganesha releases and also to limit the number of entries nfs-ganesha can cache, a new configuration option is added to '/etc/ganesha/ganesha.conf' file.

Comment 53 errata-xmlrpc 2016-06-23 05:32:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1247

Note You need to log in before you can comment on or make changes to this bug.