Description of problem: ======================= Client glusterfs got killed with OOM messages. Was running plain files creation of 100's in parallel from client. Version-Release number of selected component (if applicable): ============================================================ [root@vertigo ~]# gluster --version glusterfs 3.7.0 built on Jun 1 2015 07:14:51 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. [root@vertigo ~]# How reproducible: ================= Seen once Steps to Reproduce: 1. Create a 1x(8+3) disperse volume. Disable quota. Enable USS 2. Fuse mount on the client 3. Create files with the below command : for i in `seq 1 100`; do mkdir dir.$i ; for j in `seq 1 100`; do dd if=/dev/urandom of=dir.$i/testfile.$j bs=64k count=$j & done ; wait ; done Actual results: =============== OOM kill of glusterfs Expected results: ================ No memory leaks Additional info: ================ sosreports will be attached.
Created attachment 1033606 [details] sosreport of cilent
This is seen even if the USS is off. Brought down 2 of the bricks in 4+2, start linux untar and brought them up. untar hung and glusterfs got killed.
I feel this is blocker. Please mark it blocker+.
With the fix for 1227649, i.e. https://code.engineering.redhat.com/gerrit/49909 I am able to run the test given in the bug description without any OOM killers. The reason for the leaks is the stale lock structures which also ref the inodes which increase and eventually lead to death of the mount.
verified this on 3.7.1-3 and didn't see the issue. Marking this as fixed.
Ran iozone on 10 files simultaneously and seen the memory leak. Glusterfs is getting killed with OOM messages. Re-opening the bug. This is on 3.7.1-4 [root@rhs-client29 iozone]# 12 Error reading block 587 Error reading block 505 Error reading block 888, fd= 3 Filename testfile.7 Read returned -1 Seeked to 796 Reclen = 4096 Error reading block 562 Error reading block 941, fd= 3 Filename testfile.8 Read returned -1 Seeked to 678 Reclen = 4096 Error reading block 576 Can not fdopen temp file: testfile.3 107 Can not fdopen temp file: testfile.9 107 fdopen: Transport endpoint is not connected read: Software caused connection abort fdopen: Transport endpoint is not connected read: Software caused connection abort Can not fdopen temp file: testfile.2 107 read: Software caused connection abort read: Transport endpoint is not connected read: Software caused connection abort fdopen: Transport endpoint is not connected Can not fdopen temp file: testfile.1 107 fdopen: Transport endpoint is not connected read: Software caused connection abort dmesg output: Out of memory: Kill process 4169 (glusterfs) score 925 or sacrifice child Killed process 4169, UID 0, (glusterfs) total-vm:19552852kB, anon-rss:7615896kB, file-rss:8kB [root@rhs-client29 iozone]#
Can you please provide sosreports and more details of the system in terms of resources etc. when the crash happened? Additionally providing the exact command line used with iozone would help.
As per Bhaskar not re-creatable in 3.7.1-4. He will close it if it is working fine with 3.7.1-5 as well.
Moving it to ON_QA based on comment #9.
The command i used is : for i in `seq 1 10`; do /opt/iozone3_430/src/current/iozone -az -i0 -i1 & done and the client is a physical machine with 8GB RAM. I failed to collect the sosreport while the crash happened. I would if i see this again on the latest build.
Bhaskar, I need following information: 1) Is this bug intermittent? 2) When this issue happens lot of selfheals are triggered on the mount, in other words do you see lot of failures in the brick logs? The only possibility I see for this is, if the mount triggers too many heals leading to OOM issue. We probably need rate-limiting as a fix for this. Pranith
Pranith, 1. No. reproducible with iozone consistently. 2. I haven't observed this. Need to check
verified this on 3.7.1-7 build and didn't see the OOM killers. Marking this as fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html