Bug 976800

Summary: running dbench results in leaked fds leading to OOM killer killing glusterfsd.
Product: [Community] GlusterFS Reporter: Ravishankar N <ravishankar>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 977250 985388 (view as bug list) Environment:
Last Closed: 2013-07-24 17:20:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 977250, 985388    

Description Ravishankar N 2013-06-21 13:40:26 UTC
Description of problem:
Running dbench on a distributed replicate volume causes leaked fds on the server thereby causing the OOM killer to kill the brick process.

Output of dmesg on server:
========================================================================
<snip>
VFS: file-max limit 188568 reached
.
.
.
Out of memory: Kill process 12235 (glusterfsd) score 215 or sacrifice child
Killed process 12235, UID 0, (glusterfsd) total-vm:3138856kB, anon-rss:466728kB, file-rss:1028kB
glusterfsd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
glusterfsd cpuset=/ mems_allowed=0
Pid: 12333, comm: glusterfsd Not tainted 2.6.32-358.6.2.el6.x86_64 #1

</snip>
========================================================================


How reproducible:
Always

Steps to Reproduce:
1. create a 2x2 distributed replicate volume and FUSE mount it
2. On the mount point,run "dbench -s -F -S -x  --one-byte-write-fix --stat-check 10"
3.Kill dbench after running for about 3 minutes.
4. On the server, do
 ls -l /proc/pid_of_brick(s)/fd|grep deleted


Actual results:
We can still see open/unlinked fds even though dbench was killed. Also if dbench is run till completion, we can observe some of the bricks getting killed by the oom killer (ps aux |grep glusterfsd)

Expected results:
Once dbench is killed, the brick processes must not have open/unlinked fds.

Additional info:
Bisected the leak to the following commit ID on upstream:
* 8909c28 - cluster/afr: fsync() guarantees POST-OP completion

Comment 1 Anand Avati 2013-06-24 07:03:34 UTC
REVIEW: http://review.gluster.org/5248 (cluster/afr: Fix fd/memory leak on fsync) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 2 Anand Avati 2013-06-24 16:45:57 UTC
COMMIT: http://review.gluster.org/5248 committed in master by Anand Avati (avati) 
------
commit 03f5172dd50b50988c65dd66e87a0d43e78a3810
Author: Pranith Kumar K <pkarampu>
Date:   Mon Jun 24 08:15:09 2013 +0530

    cluster/afr: Fix fd/memory leak on fsync
    
    Change-Id: I764883811e30ca9d9c249ad00b6762101083a2fe
    BUG: 976800
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/5248
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Jeff Darcy <jdarcy>