Bug 1227197 - Disperse volume : Memory leak in client glusterfs
Summary: Disperse volume : Memory leak in client glusterfs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.1.0
Assignee: Pranith Kumar K
QA Contact: Bhaskarakiran
URL:
Whiteboard:
Depends On:
Blocks: 1202842 1223636 1230612
TreeView+ depends on / blocked
 
Reported: 2015-06-02 06:33 UTC by Bhaskarakiran
Modified: 2016-11-23 23:12 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.7.1-7.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-07-29 04:54:50 UTC
Embargoed:


Attachments (Terms of Use)
sosreport of cilent (8.18 MB, application/x-xz)
2015-06-02 06:37 UTC, Bhaskarakiran
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:1495 0 normal SHIPPED_LIVE Important: Red Hat Gluster Storage 3.1 update 2015-07-29 08:26:26 UTC

Description Bhaskarakiran 2015-06-02 06:33:08 UTC
Description of problem:
=======================

Client glusterfs got killed with OOM messages.
Was running plain files creation of 100's in parallel from client. 


Version-Release number of selected component (if applicable):
============================================================
[root@vertigo ~]# gluster --version
glusterfs 3.7.0 built on Jun  1 2015 07:14:51
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
[root@vertigo ~]# 


How reproducible:
=================
Seen once

Steps to Reproduce:
1. Create a 1x(8+3) disperse volume. Disable quota. Enable USS
2. Fuse mount on the client 
3. Create files with the below command :

for i in `seq  1 100`; do mkdir dir.$i ; for j in `seq 1 100`; do dd if=/dev/urandom of=dir.$i/testfile.$j bs=64k count=$j & done ; wait ; done

Actual results:
===============
OOM kill of glusterfs

Expected results:
================
No memory leaks

Additional info:
================
sosreports will be attached.

Comment 2 Bhaskarakiran 2015-06-02 06:37:31 UTC
Created attachment 1033606 [details]
sosreport of cilent

Comment 3 Bhaskarakiran 2015-06-03 05:45:04 UTC
This is seen even if the USS is off. 

Brought down 2 of the bricks in 4+2, start linux untar and brought them up. untar hung and glusterfs got killed.

Comment 4 Pranith Kumar K 2015-06-04 03:21:28 UTC
I feel this is blocker. Please mark it blocker+.

Comment 5 Pranith Kumar K 2015-06-08 13:07:29 UTC
With the fix for 1227649, i.e. https://code.engineering.redhat.com/gerrit/49909 I am able to run the test given in the bug description without any OOM killers. The reason for the leaks is the stale lock structures which also ref the inodes which increase and eventually lead to death of the mount.

Comment 6 Bhaskarakiran 2015-06-18 05:29:43 UTC
verified this on 3.7.1-3 and didn't see the issue. Marking this as fixed.

Comment 7 Bhaskarakiran 2015-06-22 12:03:16 UTC
Ran iozone on 10 files simultaneously and seen the memory leak. Glusterfs is getting killed with OOM messages. Re-opening the bug. This is on 3.7.1-4

[root@rhs-client29 iozone]#         12
Error reading block 587

Error reading block 505

Error reading block 888, fd= 3 Filename testfile.7 Read returned -1

Seeked to 796 Reclen = 4096 

Error reading block 562

Error reading block 941, fd= 3 Filename testfile.8 Read returned -1

Seeked to 678 Reclen = 4096 

Error reading block 576

Can not fdopen temp file: testfile.3 107

Can not fdopen temp file: testfile.9 107
fdopen: Transport endpoint is not connected
read: Software caused connection abort
fdopen: Transport endpoint is not connected
read: Software caused connection abort

Can not fdopen temp file: testfile.2 107
read: Software caused connection abort
read: Transport endpoint is not connected
read: Software caused connection abort
fdopen: Transport endpoint is not connected

Can not fdopen temp file: testfile.1 107
fdopen: Transport endpoint is not connected
read: Software caused connection abort

dmesg output:

Out of memory: Kill process 4169 (glusterfs) score 925 or sacrifice child
Killed process 4169, UID 0, (glusterfs) total-vm:19552852kB, anon-rss:7615896kB, file-rss:8kB
[root@rhs-client29 iozone]#

Comment 8 Vijay Bellur 2015-06-24 17:24:09 UTC
Can you please provide sosreports and more details of the system in terms of resources etc. when the crash happened? Additionally providing the exact command line used with iozone would help.

Comment 9 Pranith Kumar K 2015-06-25 06:52:34 UTC
As per Bhaskar not re-creatable in 3.7.1-4. He will close it if it is working fine with 3.7.1-5 as well.

Comment 10 Nagaprasad Sathyanarayana 2015-06-26 02:08:42 UTC
Moving it to ON_QA based on comment #9.

Comment 11 Bhaskarakiran 2015-06-26 10:29:00 UTC
The command i used is :

for i in `seq 1 10`; do /opt/iozone3_430/src/current/iozone -az -i0 -i1 & done

and the client is a physical machine with 8GB RAM. I failed to collect the sosreport while the crash happened. I would if i see this again on the latest build.

Comment 13 Pranith Kumar K 2015-06-27 11:08:35 UTC
Bhaskar,
     I need following information:
1) Is this bug intermittent?
2) When this issue happens lot of selfheals are triggered on the mount, in other words do you see lot of failures in the brick logs?

The only possibility I see for this is, if the mount triggers too many heals leading to OOM issue. We probably need rate-limiting as a fix for this.

Pranith

Comment 14 Bhaskarakiran 2015-06-29 06:36:08 UTC
Pranith,

1. No. reproducible with iozone consistently.
2. I haven't observed this. Need to check

Comment 15 Bhaskarakiran 2015-07-05 11:23:05 UTC
verified this on 3.7.1-7 build and didn't see the OOM killers. Marking this as fixed.

Comment 16 errata-xmlrpc 2015-07-29 04:54:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html


Note You need to log in before you can comment on or make changes to this bug.