Bug 1247221 - glusterfsd dies with OOM after a simple find executed on one volume
glusterfsd dies with OOM after a simple find executed on one volume
Status: VERIFIED
Product: GlusterFS
Classification: Community
Component: core (Show other bugs)
mainline
x86_64 Linux
urgent Severity urgent
: ---
: ---
Assigned To: Raghavendra G
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-07-27 11:00 EDT by mbienek
Modified: 2016-09-20 02:30 EDT (History)
7 users (show)

See Also:
Fixed In Version: v3.7.3
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
OOM error, statedump, some logs from the server (1.21 MB, application/x-tar)
2015-07-27 11:00 EDT, mbienek
no flags Details

  None (edit)
Description mbienek 2015-07-27 11:00:31 EDT
Created attachment 1056643 [details]
OOM error,  statedump, some logs from the server

Description of problem:

- When executing a simple 'find . -type f' on a volume with around 600 dirs and 8000 files gluster-server explodes with CPU and memory usage and finally dies with a OOM. 

9496.724134] Out of memory: Kill process 10376 (glusterfsd) score 565 or sacrifice child
[ 9496.725518] Killed process 10376 (glusterfsd) total-vm:25838340kB, anon-rss:1737572kB, file-rss:0kB


Version of GlusterFS package installed:
glusterfs-server_3.7.2-11437551431_amd64 

on Ubuntu Trusty 14.04.2:  
3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux


GlusterFS Cluster Information:
- Number of volumes: 10
- Volume on which the particular issue is seen: 1 
- Type of volumes: Replicated
- Output of gluster volume info
Volume Name: ebayk_kftp
ype: Replicate
  Volume ID: 11c2ee66-a186-4136-b577-f23c9c34c500
  Status: Started
  Number of Bricks: 1 x 3 = 3
  Transport-type: tcp
  Bricks:
  Brick1: glustercg47-1:/data/ebayk_kftp
  Brick2: glustercg47-2:/data/ebayk_kftp
  Brick3: glustercg47-3:/data/ebayk_kftp
  Options Reconfigured:
  nfs.disable: On
  features.quota-deem-statfs: on
  features.inode-quota: on
  features.quota: on
  auth.allow: 10.38.*,10.46.*,10.47.*
  performance.readdir-ahead: on

Output of gluster volume status
Attached

Get the statedump of the volume with the problem
Attached 


Client Information: 
- OS Type: Debian
- Mount type:  glusterfs _netdev,defaults        0       0
- OS Version:  Wheezy 7.8 


Version-Release number of selected component (if applicable):
glusterfs-server_3.7.2-11437551431_amd64.deb on Ubuntu Trusty 14.04.2:
3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:

Steps to Reproduce:
1. Start the volume 
2. Run 'find . -type f' 
3. After some time 1st gluster node will die because OOM 
4. volume will not go online 

Actual results:
Dies because OOM 

Expected results:


Additional info:

There are 3 gluster nodes running on two esx hosts with SSD disks as a storage pool. 
The problem happens when there is only 1 CPU and 1GB of RAM configured for every VM but it also happens when there are 8CPU's an 16 - 32GB of RAM configured.
Comment 1 Atin Mukherjee 2015-07-27 23:51:43 EDT
This looks like a brick process OOM killed, not glusterd. Could you confirm?
Comment 2 mbienek 2015-07-28 03:31:09 EDT
Hello, 

from what I can see it is the glusterfsd process: 

output from ps before I ran the test:
... 
root     11400  0.1 18.6 1838788 382844 ?      Ssl  Jul27   1:47 /usr/sbin/glusterfsd -s glustercg47-1 --volfile-id ebayk_kftp.glustercg47-1.data-ebayk_kftp -p /var/lib/glusterd/vols/ebayk_kftp/run/glustercg47-1-data-ebayk_kftp.pid -S /var/run/gluster/b3ab78d53ad126540462707510c617ca.socket --brick-name /data/ebayk_kftp -l /var/log/glusterfs/bricks/data-ebayk_kftp.log --xlator-option *-posix.glusterd-uuid=1473642e-57ce-48c2-83a5-2ef7cf3ffcc8 --brick-port 49159 --xlator-option ebayk_kftp-server.listen-port=49159
...

output from dmesg, after the process got killed by OS: 
...
[71127.204056] [ 7416]  0  7416   109022      154      63     7590    0 glusterfs
[71127.204058] [11400]  0 11400  3613894   427577    6819     6178    0 glusterfsd
[71127.204060] [11419]  0 11419   240928    13136     118    12316    0 glusterfs
[71127.204061] [11428]  0 11428    88779     7693      64     6052    0 glusterfs
[71127.204063] [14002]   104 14002     5714       59      15  0       0 pickup
[71127.204064] [16846]   510 16846     1852       35       9  0       0 iostat
[71127.204066] Out of memory: Kill process 11400 (glusterfsd) score 551 or sacrifice child
[71127.206009] Killed process 11400 (glusterfsd) total-vm:14455576kB, anon-rss:1710308kB, file-rss:0kB
Comment 3 Pranith Kumar K 2015-07-28 08:51:16 EDT
Vijai,
      This looks a lot like the memory leaks you fixed in quota. Could you please provide the patches that fixed the issue in this comment?

hi mbienek@ebay.com,
      Thanks for taking the time to log the bug. I believe the fixes should be available in the next release which should go out this week. It would be great if you could confirm those patches fix the issue for you.

Pranith
Comment 4 mbienek 2015-07-28 09:13:06 EDT
Hi, 

thx for the info, so I'll wait for the next release. 
I'll keep you updated:) 

BR, 
Marcin
Comment 6 Vijaikumar Mallikarjuna 2015-07-29 04:52:47 EDT
Hi mbienek@ebay.com,

Could you please try your test with glusterfs-3.7.3 and see if the issue happens again?

glusterfs-3.7.3 is released on 28-07-2015


Thanks,
Vijay
Comment 7 mbienek 2015-07-29 07:21:30 EDT
Hi, 

after a upgrade to 3.7.3 and a reboot of the nodes (one by one). The problem looks to be fixed. I have tried out the 'find . -type f'  on couple of clients at the same time and the memory usage on the cluster is stable. No failed bricks so far:) 

Thanks! 

BR, 
Marcin
Comment 8 Pranith Kumar K 2015-07-29 07:25:47 EDT
hi Marcin,
      Thanks for verifying the bug. We are going to move the bug to VERIFIED state based on your inputs.

Pranith

Note You need to log in before you can comment on or make changes to this bug.