Bug 1227197

Summary:

Disperse volume : Memory leak in client glusterfs

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Bhaskarakiran <byarlaga>

Component:

disperse

Assignee:

Pranith Kumar K <pkarampu>

Status:

CLOSED ERRATA

QA Contact:

Bhaskarakiran <byarlaga>

Severity:

high

Docs Contact:

Priority:

high

Version:

rhgs-3.1

CC:

annair, asrivast, byarlaga, mzywusko, nsathyan, pkarampu, rcyriac, rhs-bugs, storage-qa-internal, vagarwal, vbellur

Target Milestone:

---

Target Release:

RHGS 3.1.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glusterfs-3.7.1-7.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-07-29 04:54:50 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1202842, 1223636, 1230612

Attachments:

Description	Flags
sosreport of cilent	none

Description Bhaskarakiran 2015-06-02 06:33:08 UTC

Description of problem:
=======================

Client glusterfs got killed with OOM messages.
Was running plain files creation of 100's in parallel from client. 


Version-Release number of selected component (if applicable):
============================================================
[root@vertigo ~]# gluster --version
glusterfs 3.7.0 built on Jun  1 2015 07:14:51
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
[root@vertigo ~]# 


How reproducible:
=================
Seen once

Steps to Reproduce:
1. Create a 1x(8+3) disperse volume. Disable quota. Enable USS
2. Fuse mount on the client 
3. Create files with the below command :

for i in `seq  1 100`; do mkdir dir.$i ; for j in `seq 1 100`; do dd if=/dev/urandom of=dir.$i/testfile.$j bs=64k count=$j & done ; wait ; done

Actual results:
===============
OOM kill of glusterfs

Expected results:
================
No memory leaks

Additional info:
================
sosreports will be attached.

Comment 2 Bhaskarakiran 2015-06-02 06:37:31 UTC

Created attachment 1033606 [details]
sosreport of cilent

Comment 3 Bhaskarakiran 2015-06-03 05:45:04 UTC

This is seen even if the USS is off. 

Brought down 2 of the bricks in 4+2, start linux untar and brought them up. untar hung and glusterfs got killed.

Comment 4 Pranith Kumar K 2015-06-04 03:21:28 UTC

I feel this is blocker. Please mark it blocker+.

Comment 5 Pranith Kumar K 2015-06-08 13:07:29 UTC

With the fix for 1227649, i.e. https://code.engineering.redhat.com/gerrit/49909 I am able to run the test given in the bug description without any OOM killers. The reason for the leaks is the stale lock structures which also ref the inodes which increase and eventually lead to death of the mount.

Comment 6 Bhaskarakiran 2015-06-18 05:29:43 UTC

verified this on 3.7.1-3 and didn't see the issue. Marking this as fixed.

Comment 7 Bhaskarakiran 2015-06-22 12:03:16 UTC

Ran iozone on 10 files simultaneously and seen the memory leak. Glusterfs is getting killed with OOM messages. Re-opening the bug. This is on 3.7.1-4

[root@rhs-client29 iozone]#         12
Error reading block 587

Error reading block 505

Error reading block 888, fd= 3 Filename testfile.7 Read returned -1

Seeked to 796 Reclen = 4096 

Error reading block 562

Error reading block 941, fd= 3 Filename testfile.8 Read returned -1

Seeked to 678 Reclen = 4096 

Error reading block 576

Can not fdopen temp file: testfile.3 107

Can not fdopen temp file: testfile.9 107
fdopen: Transport endpoint is not connected
read: Software caused connection abort
fdopen: Transport endpoint is not connected
read: Software caused connection abort

Can not fdopen temp file: testfile.2 107
read: Software caused connection abort
read: Transport endpoint is not connected
read: Software caused connection abort
fdopen: Transport endpoint is not connected

Can not fdopen temp file: testfile.1 107
fdopen: Transport endpoint is not connected
read: Software caused connection abort

dmesg output:

Out of memory: Kill process 4169 (glusterfs) score 925 or sacrifice child
Killed process 4169, UID 0, (glusterfs) total-vm:19552852kB, anon-rss:7615896kB, file-rss:8kB
[root@rhs-client29 iozone]#

Comment 8 Vijay Bellur 2015-06-24 17:24:09 UTC

Can you please provide sosreports and more details of the system in terms of resources etc. when the crash happened? Additionally providing the exact command line used with iozone would help.

Comment 9 Pranith Kumar K 2015-06-25 06:52:34 UTC

As per Bhaskar not re-creatable in 3.7.1-4. He will close it if it is working fine with 3.7.1-5 as well.

Comment 10 Nagaprasad Sathyanarayana 2015-06-26 02:08:42 UTC

Moving it to ON_QA based on comment #9.

Comment 11 Bhaskarakiran 2015-06-26 10:29:00 UTC

The command i used is :

for i in `seq 1 10`; do /opt/iozone3_430/src/current/iozone -az -i0 -i1 & done

and the client is a physical machine with 8GB RAM. I failed to collect the sosreport while the crash happened. I would if i see this again on the latest build.

Comment 13 Pranith Kumar K 2015-06-27 11:08:35 UTC

Bhaskar,
     I need following information:
1) Is this bug intermittent?
2) When this issue happens lot of selfheals are triggered on the mount, in other words do you see lot of failures in the brick logs?

The only possibility I see for this is, if the mount triggers too many heals leading to OOM issue. We probably need rate-limiting as a fix for this.

Pranith

Comment 14 Bhaskarakiran 2015-06-29 06:36:08 UTC

Pranith,

1. No. reproducible with iozone consistently.
2. I haven't observed this. Need to check

Comment 15 Bhaskarakiran 2015-07-05 11:23:05 UTC

verified this on 3.7.1-7 build and didn't see the OOM killers. Marking this as fixed.

Comment 16 errata-xmlrpc 2015-07-29 04:54:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html