Bug 1247221 - glusterfsd dies with OOM after a simple find executed on one volume
glusterfsd dies with OOM after a simple find executed on one volume
Product: GlusterFS
Classification: Community
Component: core (Show other bugs)
x86_64 Linux
urgent Severity urgent
: ---
: ---
Assigned To: Raghavendra G
: Triaged
Depends On:
  Show dependency treegraph
Reported: 2015-07-27 11:00 EDT by mbienek
Modified: 2016-09-20 02:30 EDT (History)
7 users (show)

See Also:
Fixed In Version: v3.7.3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
OOM error, statedump, some logs from the server (1.21 MB, application/x-tar)
2015-07-27 11:00 EDT, mbienek
no flags Details

  None (edit)
Description mbienek 2015-07-27 11:00:31 EDT
Created attachment 1056643 [details]
OOM error,  statedump, some logs from the server

Description of problem:

- When executing a simple 'find . -type f' on a volume with around 600 dirs and 8000 files gluster-server explodes with CPU and memory usage and finally dies with a OOM. 

9496.724134] Out of memory: Kill process 10376 (glusterfsd) score 565 or sacrifice child
[ 9496.725518] Killed process 10376 (glusterfsd) total-vm:25838340kB, anon-rss:1737572kB, file-rss:0kB

Version of GlusterFS package installed:

on Ubuntu Trusty 14.04.2:  
3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

GlusterFS Cluster Information:
- Number of volumes: 10
- Volume on which the particular issue is seen: 1 
- Type of volumes: Replicated
- Output of gluster volume info
Volume Name: ebayk_kftp
ype: Replicate
  Volume ID: 11c2ee66-a186-4136-b577-f23c9c34c500
  Status: Started
  Number of Bricks: 1 x 3 = 3
  Transport-type: tcp
  Brick1: glustercg47-1:/data/ebayk_kftp
  Brick2: glustercg47-2:/data/ebayk_kftp
  Brick3: glustercg47-3:/data/ebayk_kftp
  Options Reconfigured:
  nfs.disable: On
  features.quota-deem-statfs: on
  features.inode-quota: on
  features.quota: on
  auth.allow: 10.38.*,10.46.*,10.47.*
  performance.readdir-ahead: on

Output of gluster volume status

Get the statedump of the volume with the problem

Client Information: 
- OS Type: Debian
- Mount type:  glusterfs _netdev,defaults        0       0
- OS Version:  Wheezy 7.8 

Version-Release number of selected component (if applicable):
glusterfs-server_3.7.2-11437551431_amd64.deb on Ubuntu Trusty 14.04.2:
3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:

Steps to Reproduce:
1. Start the volume 
2. Run 'find . -type f' 
3. After some time 1st gluster node will die because OOM 
4. volume will not go online 

Actual results:
Dies because OOM 

Expected results:

Additional info:

There are 3 gluster nodes running on two esx hosts with SSD disks as a storage pool. 
The problem happens when there is only 1 CPU and 1GB of RAM configured for every VM but it also happens when there are 8CPU's an 16 - 32GB of RAM configured.
Comment 1 Atin Mukherjee 2015-07-27 23:51:43 EDT
This looks like a brick process OOM killed, not glusterd. Could you confirm?
Comment 2 mbienek 2015-07-28 03:31:09 EDT

from what I can see it is the glusterfsd process: 

output from ps before I ran the test:
root     11400  0.1 18.6 1838788 382844 ?      Ssl  Jul27   1:47 /usr/sbin/glusterfsd -s glustercg47-1 --volfile-id ebayk_kftp.glustercg47-1.data-ebayk_kftp -p /var/lib/glusterd/vols/ebayk_kftp/run/glustercg47-1-data-ebayk_kftp.pid -S /var/run/gluster/b3ab78d53ad126540462707510c617ca.socket --brick-name /data/ebayk_kftp -l /var/log/glusterfs/bricks/data-ebayk_kftp.log --xlator-option *-posix.glusterd-uuid=1473642e-57ce-48c2-83a5-2ef7cf3ffcc8 --brick-port 49159 --xlator-option ebayk_kftp-server.listen-port=49159

output from dmesg, after the process got killed by OS: 
[71127.204056] [ 7416]  0  7416   109022      154      63     7590    0 glusterfs
[71127.204058] [11400]  0 11400  3613894   427577    6819     6178    0 glusterfsd
[71127.204060] [11419]  0 11419   240928    13136     118    12316    0 glusterfs
[71127.204061] [11428]  0 11428    88779     7693      64     6052    0 glusterfs
[71127.204063] [14002]   104 14002     5714       59      15  0       0 pickup
[71127.204064] [16846]   510 16846     1852       35       9  0       0 iostat
[71127.204066] Out of memory: Kill process 11400 (glusterfsd) score 551 or sacrifice child
[71127.206009] Killed process 11400 (glusterfsd) total-vm:14455576kB, anon-rss:1710308kB, file-rss:0kB
Comment 3 Pranith Kumar K 2015-07-28 08:51:16 EDT
      This looks a lot like the memory leaks you fixed in quota. Could you please provide the patches that fixed the issue in this comment?

hi mbienek@ebay.com,
      Thanks for taking the time to log the bug. I believe the fixes should be available in the next release which should go out this week. It would be great if you could confirm those patches fix the issue for you.

Comment 4 mbienek 2015-07-28 09:13:06 EDT

thx for the info, so I'll wait for the next release. 
I'll keep you updated:) 

Comment 6 Vijaikumar Mallikarjuna 2015-07-29 04:52:47 EDT
Hi mbienek@ebay.com,

Could you please try your test with glusterfs-3.7.3 and see if the issue happens again?

glusterfs-3.7.3 is released on 28-07-2015

Comment 7 mbienek 2015-07-29 07:21:30 EDT

after a upgrade to 3.7.3 and a reboot of the nodes (one by one). The problem looks to be fixed. I have tried out the 'find . -type f'  on couple of clients at the same time and the memory usage on the cluster is stable. No failed bricks so far:) 


Comment 8 Pranith Kumar K 2015-07-29 07:25:47 EDT
hi Marcin,
      Thanks for verifying the bug. We are going to move the bug to VERIFIED state based on your inputs.


Note You need to log in before you can comment on or make changes to this bug.