Bug 1247221 - glusterfsd dies with OOM after a simple find executed on one volume
Summary: glusterfsd dies with OOM after a simple find executed on one volume
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: mainline
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-07-27 15:00 UTC by mbienek
Modified: 2018-10-08 09:54 UTC (History)
8 users (show)

Fixed In Version: v3.7.3
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-08 09:54:14 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
OOM error, statedump, some logs from the server (1.21 MB, application/x-tar)
2015-07-27 15:00 UTC, mbienek
no flags Details

Description mbienek 2015-07-27 15:00:31 UTC
Created attachment 1056643 [details]
OOM error,  statedump, some logs from the server

Description of problem:

- When executing a simple 'find . -type f' on a volume with around 600 dirs and 8000 files gluster-server explodes with CPU and memory usage and finally dies with a OOM. 

9496.724134] Out of memory: Kill process 10376 (glusterfsd) score 565 or sacrifice child
[ 9496.725518] Killed process 10376 (glusterfsd) total-vm:25838340kB, anon-rss:1737572kB, file-rss:0kB


Version of GlusterFS package installed:
glusterfs-server_3.7.2-11437551431_amd64 

on Ubuntu Trusty 14.04.2:  
3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux


GlusterFS Cluster Information:
- Number of volumes: 10
- Volume on which the particular issue is seen: 1 
- Type of volumes: Replicated
- Output of gluster volume info
Volume Name: ebayk_kftp
ype: Replicate
  Volume ID: 11c2ee66-a186-4136-b577-f23c9c34c500
  Status: Started
  Number of Bricks: 1 x 3 = 3
  Transport-type: tcp
  Bricks:
  Brick1: glustercg47-1:/data/ebayk_kftp
  Brick2: glustercg47-2:/data/ebayk_kftp
  Brick3: glustercg47-3:/data/ebayk_kftp
  Options Reconfigured:
  nfs.disable: On
  features.quota-deem-statfs: on
  features.inode-quota: on
  features.quota: on
  auth.allow: 10.38.*,10.46.*,10.47.*
  performance.readdir-ahead: on

Output of gluster volume status
Attached

Get the statedump of the volume with the problem
Attached 


Client Information: 
- OS Type: Debian
- Mount type:  glusterfs _netdev,defaults        0       0
- OS Version:  Wheezy 7.8 


Version-Release number of selected component (if applicable):
glusterfs-server_3.7.2-11437551431_amd64.deb on Ubuntu Trusty 14.04.2:
3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:

Steps to Reproduce:
1. Start the volume 
2. Run 'find . -type f' 
3. After some time 1st gluster node will die because OOM 
4. volume will not go online 

Actual results:
Dies because OOM 

Expected results:


Additional info:

There are 3 gluster nodes running on two esx hosts with SSD disks as a storage pool. 
The problem happens when there is only 1 CPU and 1GB of RAM configured for every VM but it also happens when there are 8CPU's an 16 - 32GB of RAM configured.

Comment 1 Atin Mukherjee 2015-07-28 03:51:43 UTC
This looks like a brick process OOM killed, not glusterd. Could you confirm?

Comment 2 mbienek 2015-07-28 07:31:09 UTC
Hello, 

from what I can see it is the glusterfsd process: 

output from ps before I ran the test:
... 
root     11400  0.1 18.6 1838788 382844 ?      Ssl  Jul27   1:47 /usr/sbin/glusterfsd -s glustercg47-1 --volfile-id ebayk_kftp.glustercg47-1.data-ebayk_kftp -p /var/lib/glusterd/vols/ebayk_kftp/run/glustercg47-1-data-ebayk_kftp.pid -S /var/run/gluster/b3ab78d53ad126540462707510c617ca.socket --brick-name /data/ebayk_kftp -l /var/log/glusterfs/bricks/data-ebayk_kftp.log --xlator-option *-posix.glusterd-uuid=1473642e-57ce-48c2-83a5-2ef7cf3ffcc8 --brick-port 49159 --xlator-option ebayk_kftp-server.listen-port=49159
...

output from dmesg, after the process got killed by OS: 
...
[71127.204056] [ 7416]  0  7416   109022      154      63     7590    0 glusterfs
[71127.204058] [11400]  0 11400  3613894   427577    6819     6178    0 glusterfsd
[71127.204060] [11419]  0 11419   240928    13136     118    12316    0 glusterfs
[71127.204061] [11428]  0 11428    88779     7693      64     6052    0 glusterfs
[71127.204063] [14002]   104 14002     5714       59      15  0       0 pickup
[71127.204064] [16846]   510 16846     1852       35       9  0       0 iostat
[71127.204066] Out of memory: Kill process 11400 (glusterfsd) score 551 or sacrifice child
[71127.206009] Killed process 11400 (glusterfsd) total-vm:14455576kB, anon-rss:1710308kB, file-rss:0kB

Comment 3 Pranith Kumar K 2015-07-28 12:51:16 UTC
Vijai,
      This looks a lot like the memory leaks you fixed in quota. Could you please provide the patches that fixed the issue in this comment?

hi mbienek,
      Thanks for taking the time to log the bug. I believe the fixes should be available in the next release which should go out this week. It would be great if you could confirm those patches fix the issue for you.

Pranith

Comment 4 mbienek 2015-07-28 13:13:06 UTC
Hi, 

thx for the info, so I'll wait for the next release. 
I'll keep you updated:) 

BR, 
Marcin

Comment 6 Vijaikumar Mallikarjuna 2015-07-29 08:52:47 UTC
Hi mbienek,

Could you please try your test with glusterfs-3.7.3 and see if the issue happens again?

glusterfs-3.7.3 is released on 28-07-2015


Thanks,
Vijay

Comment 7 mbienek 2015-07-29 11:21:30 UTC
Hi, 

after a upgrade to 3.7.3 and a reboot of the nodes (one by one). The problem looks to be fixed. I have tried out the 'find . -type f'  on couple of clients at the same time and the memory usage on the cluster is stable. No failed bricks so far:) 

Thanks! 

BR, 
Marcin

Comment 8 Pranith Kumar K 2015-07-29 11:25:47 UTC
hi Marcin,
      Thanks for verifying the bug. We are going to move the bug to VERIFIED state based on your inputs.

Pranith

Comment 11 Amar Tumballi 2018-10-08 09:54:14 UTC
As per the comment #8


Note You need to log in before you can comment on or make changes to this bug.