Created attachment 1056643 [details] OOM error, statedump, some logs from the server Description of problem: - When executing a simple 'find . -type f' on a volume with around 600 dirs and 8000 files gluster-server explodes with CPU and memory usage and finally dies with a OOM. 9496.724134] Out of memory: Kill process 10376 (glusterfsd) score 565 or sacrifice child [ 9496.725518] Killed process 10376 (glusterfsd) total-vm:25838340kB, anon-rss:1737572kB, file-rss:0kB Version of GlusterFS package installed: glusterfs-server_3.7.2-11437551431_amd64 on Ubuntu Trusty 14.04.2: 3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux GlusterFS Cluster Information: - Number of volumes: 10 - Volume on which the particular issue is seen: 1 - Type of volumes: Replicated - Output of gluster volume info Volume Name: ebayk_kftp ype: Replicate Volume ID: 11c2ee66-a186-4136-b577-f23c9c34c500 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: glustercg47-1:/data/ebayk_kftp Brick2: glustercg47-2:/data/ebayk_kftp Brick3: glustercg47-3:/data/ebayk_kftp Options Reconfigured: nfs.disable: On features.quota-deem-statfs: on features.inode-quota: on features.quota: on auth.allow: 10.38.*,10.46.*,10.47.* performance.readdir-ahead: on Output of gluster volume status Attached Get the statedump of the volume with the problem Attached Client Information: - OS Type: Debian - Mount type: glusterfs _netdev,defaults 0 0 - OS Version: Wheezy 7.8 Version-Release number of selected component (if applicable): glusterfs-server_3.7.2-11437551431_amd64.deb on Ubuntu Trusty 14.04.2: 3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Steps to Reproduce: 1. Start the volume 2. Run 'find . -type f' 3. After some time 1st gluster node will die because OOM 4. volume will not go online Actual results: Dies because OOM Expected results: Additional info: There are 3 gluster nodes running on two esx hosts with SSD disks as a storage pool. The problem happens when there is only 1 CPU and 1GB of RAM configured for every VM but it also happens when there are 8CPU's an 16 - 32GB of RAM configured.
This looks like a brick process OOM killed, not glusterd. Could you confirm?
Hello, from what I can see it is the glusterfsd process: output from ps before I ran the test: ... root 11400 0.1 18.6 1838788 382844 ? Ssl Jul27 1:47 /usr/sbin/glusterfsd -s glustercg47-1 --volfile-id ebayk_kftp.glustercg47-1.data-ebayk_kftp -p /var/lib/glusterd/vols/ebayk_kftp/run/glustercg47-1-data-ebayk_kftp.pid -S /var/run/gluster/b3ab78d53ad126540462707510c617ca.socket --brick-name /data/ebayk_kftp -l /var/log/glusterfs/bricks/data-ebayk_kftp.log --xlator-option *-posix.glusterd-uuid=1473642e-57ce-48c2-83a5-2ef7cf3ffcc8 --brick-port 49159 --xlator-option ebayk_kftp-server.listen-port=49159 ... output from dmesg, after the process got killed by OS: ... [71127.204056] [ 7416] 0 7416 109022 154 63 7590 0 glusterfs [71127.204058] [11400] 0 11400 3613894 427577 6819 6178 0 glusterfsd [71127.204060] [11419] 0 11419 240928 13136 118 12316 0 glusterfs [71127.204061] [11428] 0 11428 88779 7693 64 6052 0 glusterfs [71127.204063] [14002] 104 14002 5714 59 15 0 0 pickup [71127.204064] [16846] 510 16846 1852 35 9 0 0 iostat [71127.204066] Out of memory: Kill process 11400 (glusterfsd) score 551 or sacrifice child [71127.206009] Killed process 11400 (glusterfsd) total-vm:14455576kB, anon-rss:1710308kB, file-rss:0kB
Vijai, This looks a lot like the memory leaks you fixed in quota. Could you please provide the patches that fixed the issue in this comment? hi mbienek, Thanks for taking the time to log the bug. I believe the fixes should be available in the next release which should go out this week. It would be great if you could confirm those patches fix the issue for you. Pranith
Hi, thx for the info, so I'll wait for the next release. I'll keep you updated:) BR, Marcin
Below patches fixes the issue in glusterfs-3.7.3: http://review.gluster.org/#/c/11361/ http://review.gluster.org/#/c/11522/ http://review.gluster.org/#/c/11526/ http://review.gluster.org/#/c/11457/ http://review.gluster.org/#/c/11499/ http://review.gluster.org/#/c/11700/
Hi mbienek, Could you please try your test with glusterfs-3.7.3 and see if the issue happens again? glusterfs-3.7.3 is released on 28-07-2015 Thanks, Vijay
Hi, after a upgrade to 3.7.3 and a reboot of the nodes (one by one). The problem looks to be fixed. I have tried out the 'find . -type f' on couple of clients at the same time and the memory usage on the cluster is stable. No failed bricks so far:) Thanks! BR, Marcin
hi Marcin, Thanks for verifying the bug. We are going to move the bug to VERIFIED state based on your inputs. Pranith
As per the comment #8