Description of problem: I was executing iozone on nfs-ganesha process and saw that nfs-ganesha was getting killed on each node one after the other, there on I found that there is a OOM kill for nfs-ganesha process. ganesha.nfsd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 ganesha.nfsd cpuset=/ mems_allowed=0 Pid: 19476, comm: ganesha.nfsd Not tainted 2.6.32-570.el6.x86_64 #1 Call Trace: [<ffffffff810d6dd1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 [<ffffffff8112a5d0>] ? dump_header+0x90/0x1b0 [<ffffffff810d7e71>] ? cpuset_mems_allowed_intersects+0x21/0x30 [<ffffffff8112aa52>] ? oom_kill_process+0x82/0x2a0 [<ffffffff8112a991>] ? select_bad_process+0xe1/0x120 [<ffffffff8112ae90>] ? out_of_memory+0x220/0x3c0 [<ffffffff8113786c>] ? __alloc_pages_nodemask+0x93c/0x950 [<ffffffffa00671b4>] ? do_get_write_access+0x3b4/0x520 [jbd2] [<ffffffff8117035a>] ? alloc_pages_vma+0x9a/0x150 [<ffffffff81152b1d>] ? handle_pte_fault+0x73d/0xb20 [<ffffffff8105e173>] ? __wake_up+0x53/0x70 [<ffffffff81153199>] ? handle_mm_fault+0x299/0x3d0 [<ffffffffa00a5e48>] ? __ext4_journal_stop+0x68/0xa0 [ext4] [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500 [<ffffffff81127ffa>] ? generic_file_buffered_write+0x1da/0x2e0 [<ffffffff8107e9b7>] ? current_fs_time+0x27/0x30 [<ffffffff8153e76e>] ? do_page_fault+0x3e/0xa0 [<ffffffff8153bb15>] ? page_fault+0x25/0x30 [<ffffffff81126b44>] ? file_read_actor+0x44/0x180 [<ffffffff81238cc1>] ? avc_has_perm+0x71/0x90 [<ffffffff81129056>] ? generic_file_aio_read+0x2d6/0x700 [<ffffffff8119191a>] ? do_sync_read+0xfa/0x140 [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8123e72b>] ? selinux_file_permission+0xfb/0x150 [<ffffffff81231496>] ? security_file_permission+0x16/0x20 [<ffffffff81192215>] ? vfs_read+0xb5/0x1a0 [<ffffffff81192fef>] ? fget_light_pos+0x3f/0x50 [<ffffffff81192561>] ? sys_read+0x51/0xb0 [<ffffffff810e88ae>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b Version-Release number of selected component (if applicable): glusterfs-3.7.1-7.el6rhs.x86_64 nfs-ganesha-2.2.0-3.el6rhs.x86_64 How reproducible: happened on all nodes Steps to Reproduce: 1. post configuring nfs-ganesha, executing iozone over vers=4 mount Actual results: [root@nfs11 ~]# gluster volume info vol2 Volume Name: vol2 Type: Distributed-Replicate Volume ID: 30ab7484-1480-46d5-8f83-4ab27199640d Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.46.8:/rhs/brick1/d1r12 Brick2: 10.70.46.27:/rhs/brick1/d1r22 Brick3: 10.70.46.25:/rhs/brick1/d2r12 Brick4: 10.70.46.29:/rhs/brick1/d2r22 Brick5: 10.70.46.8:/rhs/brick1/d3r12 Brick6: 10.70.46.27:/rhs/brick1/d3r22 Brick7: 10.70.46.25:/rhs/brick1/d4r12 Brick8: 10.70.46.29:/rhs/brick1/d4r22 Brick9: 10.70.46.8:/rhs/brick1/d5r12 Brick10: 10.70.46.27:/rhs/brick1/d5r22 Brick11: 10.70.46.25:/rhs/brick1/d6r12 Brick12: 10.70.46.29:/rhs/brick1/d6r22 Options Reconfigured: performance.readdir-ahead: on features.cache-invalidation: on ganesha.enable: on nfs.disable: on features.quota: on features.inode-quota: on features.quota-deem-statfs: on nfs-ganesha: enable [root@nfs11 ~]# gluster volume status vol2 Status of volume: vol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.8:/rhs/brick1/d1r12 49156 0 Y 2665 Brick 10.70.46.27:/rhs/brick1/d1r22 49156 0 Y 20958 Brick 10.70.46.25:/rhs/brick1/d2r12 49156 0 Y 17883 Brick 10.70.46.29:/rhs/brick1/d2r22 49155 0 Y 20935 Brick 10.70.46.8:/rhs/brick1/d3r12 49157 0 Y 2684 Brick 10.70.46.27:/rhs/brick1/d3r22 49157 0 Y 20977 Brick 10.70.46.25:/rhs/brick1/d4r12 49157 0 Y 17902 Brick 10.70.46.29:/rhs/brick1/d4r22 49156 0 Y 20954 Brick 10.70.46.8:/rhs/brick1/d5r12 49158 0 Y 2703 Brick 10.70.46.27:/rhs/brick1/d5r22 49158 0 Y 20996 Brick 10.70.46.25:/rhs/brick1/d6r12 49158 0 Y 17921 Brick 10.70.46.29:/rhs/brick1/d6r22 49157 0 Y 20973 Self-heal Daemon on localhost N/A N/A Y 2341 Quota Daemon on localhost N/A N/A Y 2352 Self-heal Daemon on 10.70.46.25 N/A N/A Y 7479 Quota Daemon on 10.70.46.25 N/A N/A Y 7484 Self-heal Daemon on 10.70.46.29 N/A N/A Y 9905 Quota Daemon on 10.70.46.29 N/A N/A N N/A Self-heal Daemon on 10.70.46.22 N/A N/A Y 7660 Quota Daemon on 10.70.46.22 N/A N/A Y 7668 Self-heal Daemon on 10.70.46.27 N/A N/A Y 10010 Quota Daemon on 10.70.46.27 N/A N/A N N/A Task Status of Volume vol2 ------------------------------------------------------------------------------ There are no active volume tasks ganesha.log, 05/07/2015 16:55:46 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE 05/07/2015 16:56:46 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE 05/07/2015 20:04:49 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 05/07/2015 20:04:49 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip nfs14 05/07/2015 20:04:49 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/nfs14/v4recov) 05/07/2015 20:04:56 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE 05/07/2015 20:05:56 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE 05/07/2015 21:07:28 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 05/07/2015 21:07:28 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip nfs13 05/07/2015 21:07:28 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/nfs13/v4recov) 05/07/2015 21:07:36 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE 05/07/2015 21:08:36 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE 05/07/2015 22:06:11 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 05/07/2015 22:06:11 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 5 nodeid -1 ip nfs12 05/07/2015 22:06:11 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] nfs4_load_recov_clids_nolock :CLIENT ID :EVENT :Recovery for nodeid -1 dir (/var/lib/nfs/ganesha/nfs12/v4recov) 05/07/2015 22:06:14 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[work-10] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE 05/07/2015 22:06:23 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy. Not sending heartbeat 05/07/2015 22:06:24 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy. Not sending heartbeat 05/07/2015 22:06:25 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy. Not sending heartbeat 05/07/2015 22:06:26 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy. Not sending heartbeat 05/07/2015 22:06:27 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy. Not sending heartbeat 05/07/2015 22:06:47 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy. Not sending heartbeat 05/07/2015 22:06:54 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy. Not sending heartbeat 05/07/2015 22:06:58 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy. Not sending heartbeat 05/07/2015 22:07:17 : epoch 55990a3a : nfs11.lab.eng.blr.redhat.com : ganesha.nfsd-2611[reaper] nfs_in_grace :STATE :EVENT :NFS Server Now NOT IN GRACE Expected results: No OOM kill to be seen for nfs-ganesha process. Additional info:
Created attachment 1048675 [details] sosreport of nfs11
Created attachment 1048679 [details] nfs11 ganesha.log
Created attachment 1048680 [details] dmesg log from nfs11
Created attachment 1048681 [details] nfs11 ganesha-gfapi.log
As per the discussion with Saurabh --> * iozone tests were ran successfully in the earlier builds/setup. * The difference we see with this setup is that quota feature is ON * In addition, it seems that this was not hit just after plain mount. There was a failover triggered by killing the nfs-ganesha service manually after around 15-min the time the test has started. Post that after some time nfs-ganesha processes got auto-killed due to OOM issues on all other nodes as well. We would like to run the test on following scenarios - * plain nfs-ganesha mount * with quota feature on * with failover test in parellel * with quota feature on and failover test in parallel and compare the results.
I tried to reproduce the OOM but this attempt of similar test I could not see the issue.
Please include details about the exact iozone command that was run, and how many clients are used.
I used "iozone -a" from one client only.
We have requested QE to re-test and reproduce the issue. Also there are few gfAPI mem leak issues being worked upon in the upstream as part of bug1295107. If we could reproduce this issue, we can apply those patches and retest it.
The fixes merged in upstream which can be taken for this bug are - http://review.gluster.org/13125 http://review.gluster.org/13096 http://review.gluster.org/#/c/13232/
Here are few of my findings/experiments done to determine cache_inode limit. From gdb and code, looks like total size occupied for each inode (including context) in a distributed replicated volume is (gdb) p sizeof(struct cache_entry_t) + sizeof(struct glusterfs_handle) + sizeof(struct glfs_object) + sizeof(afr_inode_ctx_t) + sizeof(dht_inode_ctx_t) + sizeof(ioc_inode_t) + sizeof(struct md_cache) + sizeof(struct ios_stat) + sizeof(uint64_t) + sizeof(wb_inode_t) + sizeof(qr_inode_t) $18 = 2296 i.e, around ~2k bytes. But when I tried to monitor the process memory size when data is being written to a newly created file, I see varied results - [root@dhcp43-58 ~]# showmount -e Export list for dhcp43-58.lab.eng.blr.redhat.com: [root@dhcp43-58 ~]# ps aux | grep ganesha root 19044 0.2 0.4 1360752 8240 ? Ssl 15:03 0:00 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT root 19172 0.0 0.1 117000 2144 pts/0 S+ 15:03 0:00 grep --color=auto ganesha [root@dhcp43-58 ~]# cat /proc/19044/oom_score 2 Exporting a volume 'testvol' [root@dhcp43-58 ~]# showmount -e Export list for dhcp43-58.lab.eng.blr.redhat.com: /testvol (everyone) [root@dhcp43-58 ~]# cat /proc/19044/oom_score 14 [root@dhcp43-58 ~]# ps aux | grep ganesha root 19044 0.3 3.0 1636920 61904 ? Ssl 15:03 0:00 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT root 19501 0.0 0.1 117000 2196 pts/0 S+ 15:04 0:00 grep --color=auto ganesha [root@dhcp43-58 ~]# i.e, the RSS value increased by ~54M and oom_score by 12 Exporting another volume - [root@dhcp43-58 ~]# showmount -e Export list for dhcp43-58.lab.eng.blr.redhat.com: /testvol (everyone) /test (everyone) [root@dhcp43-58 ~]# cat /proc/19044/oom_score 26 [root@dhcp43-58 ~]# ps aux | grep ganesha root 19044 0.2 5.3 1750168 110440 ? Ssl 15:03 0:00 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT root 20347 0.0 0.1 117000 2244 pts/0 S+ 15:06 0:00 grep --color=auto ganesha [root@dhcp43-58 ~]# >>> the RSS value increased by ~49M and oom_score by 12 [root@dhcp43-58 ~]# showmount -e Export list for dhcp43-58.lab.eng.blr.redhat.com: /testvol (everyone) /test (everyone) /new1 (everyone) [root@dhcp43-58 ~]# cat /proc/19044/oom_score 37 [root@dhcp43-58 ~]# [root@dhcp43-58 ~]# ps aux | grep ganesha root 19044 0.3 7.7 1863416 158888 ? Ssl 15:03 0:01 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT root 21308 0.0 0.1 117004 2228 pts/0 S+ 15:08 0:00 grep --color=auto ganesha [root@dhcp43-58 ~]# >>> the RSS value increased by ~49M and oom_score by 11 So, leaving the first volume, on an avg., each volume export resulted in increase in memory by ~49M. [root@skoduri ~]# mount -t nfs 10.70.43.58:/testvol /mnt [root@dhcp43-58 ~]# ps aux | grep ganesha root 19044 0.3 7.7 1863416 158904 ? Ssl 15:03 0:01 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT root 22175 0.0 0.1 117004 2204 pts/0 S+ 15:10 0:00 grep --color=auto ganesha [root@dhcp43-58 ~]# After writing into 100 files and reading the data [root@skoduri mnt]# for i in `seq 1 100`; do echo 3 > /proc/sys/vm/drop_caches;dd if=/dev/zero of=/mnt/file$i bs=100K count=1 conv=sync & done ; [root@skoduri mnt]# cat * [root@dhcp43-58 ~]# ps aux | grep ganesha root 19044 0.5 7.9 1863416 161756 ? Ssl 15:03 0:03 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT root 23486 0.0 0.1 117000 2096 pts/0 S+ 15:13 0:00 grep --color=auto ganesha [root@dhcp43-58 ~]# [root@dhcp43-58 ~]# cat /proc/19044/oom_score 38 [root@dhcp43-58 ~]# So on an avg, each file resulted in increase in memory by (161756-158904)/100 = ~28K But further creations of 100 new files showed different values - [root@dhcp43-58 ~]# ps aux | grep ganesha root 19044 0.5 7.9 1863416 161768 ? Ssl 15:03 0:04 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT root 24589 0.0 0.1 117000 2132 pts/0 S+ 15:16 0:00 grep --color=auto ganesha [root@dhcp43-58 ~]# [root@skoduri mnt]# for i in `seq 101 200`; do echo 3 > /proc/sys/vm/drop_caches;dd if=/dev/zero of=/mnt/file$i bs=100K count=1 conv=sync & done ; [root@skoduri mnt]# cat * [root@dhcp43-58 ~]# ps aux | grep ganesha root 19044 0.6 8.2 1867512 169272 ? Ssl 15:03 0:06 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT root 25505 0.0 0.1 117000 2192 pts/0 R+ 15:18 0:00 grep --color=auto ganesha [root@dhcp43-58 ~]# That means on an avg of ~75K increase per file. I have requested Manoj(mpillai) from perf team to do longevity tests and monitor nfs-ganesha process memory usage. We would like to determine following things a)determine an optimal value to set cache_inode limit b)if there are any other major leaks which we have missed. From https://bugzilla.redhat.com/show_bug.cgi?id=1282669#c4, Nov 13 14:53:26 sv-2000lvp88 kernel: Out of memory: Kill process 28790 (ganesha.nfsd) score 855 or sacrifice child Nov 13 14:53:26 sv-2000lvp88 kernel: Killed process 28790 (ganesha.nfsd) total-vm:19276892kB, anon-rss:15252240kB, file-rss:0kB Nov 15 02:35:27 sv-2000lvp87 kernel: Out of memory: Kill process 50329 (ganesha.nfsd) score 853 or sacrifice child Nov 15 02:35:27 sv-2000lvp87 kernel: Killed process 50329 (ganesha.nfsd) total-vm:19244616kB, anon-rss:15307828kB, file-rss:0kB Looks like on the customer setup, nfs-ganesha had consumed around ~14GB and had oom_score >800. We may need to tweak the oom_score calculated for this process (if possible).
Kindly let me know when we will have a final build with all ther required patches.
Based on comment 43 and 44, we are still waiting for the new build with relevant fixes. Once we have that, will start the verification process for hotfix.
Verified this bug with latest glusterfs and ganesha builds and the original issue "OOM kill of nfs-ganesha process while executing iozone" is not reproducible. Tried with below scenarios: * plain nfs-ganesha mount * On volume with quota enabled. * On volume with quota and failover/failback The customer bug (https://bugzilla.redhat.com/show_bug.cgi?id=1282669) and the ganesha memory consumption related issues will now be tracked as part of (https://bugzilla.redhat.com/show_bug.cgi?id=1337867) bug. For any hotfix related requests, please update bug 1337867 with the details. Based on above explanation, amrking this bug as Verified.
Soumya, Could you review and sign-off the edited doc text.
Divya, doc_text looks good to me. Few minor corrections needed for the below statement - >>> With this fix, gfapi immediately flushes the inodes in which NFS-ganesha releases and also to limit the number of entries nfs-ganesha can cache by adding configuration option to '/etc/ganesha/ganesha.conf' file. Update: With this fix, gfapi immediately flushes the inodes which NFS-ganesha releases and also to limit the number of entries nfs-ganesha can cache, a new configuration option is added to '/etc/ganesha/ganesha.conf' file.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1247