Description of problem:
I am running glusterfs 3.4.2 on linux kernel version 220.127.116.11 on two x86_64 board with 16 GB of RAM each. I have several gluster file-systems (close to 10)in twin-replicated mode containing around 4 GB of data aggregate.
Sometimes, following reboot of boards, I observe that glustershd memory % in top output increases above 50% (over 8 GB) causing problems when trying to run other key processes.
Version-Release number of selected component (if applicable):
linux kernel 18.104.22.168
Intermittent. Our systems reboot very frequently and during testing we often format our disks to clean out the bricks and then add them back. So, there is quite a lot of 'uncontrolled' self heal going on on our systems.
Steps to Reproduce:
1. Remove all the bricks on one of the serves from all replicated volumes.
2. Erase the logical volumes that comprise these brcks.
3. Re-create the bricks and add them back to the replicated volumes causing massive heal of data.
Sometimes, maybe around once in 20-30 times glustershd memory usage exceeds 50% (8 GB) causing other applications to fail spawn/terminate abruptly. Work around is to kill glustershd, and then restart /etc/init.d/glusterd to get the former to spawn back.
We would expect the memory usage to fall within a reasonable ceiling, say, 20%?
Please note that this bug is specifically for high memory consumption by the glusterfs self-heal daemon. I am aware that several other bugs exist in bugzilla catering to generic high memory consumption by glusterfs daemons, or maybe specific ones such as those pertaining to gfs nfs.
I took the statedump and found that the process is leaking 'path' from circular buffers it uses to remember the last 1024 entries that healed/failed/split-brain.
http://review.gluster.org/4790 has the fix which enables the data structure to give a cleanup function for freeing the data structure.
Found one more 'dict' leak in metadata self-heal. This leak is present even in 3.5.x. Will be cloning this bug. Thanks a lot Anirban for raising the issue.
'dict' leak I mentioned above only exists in 3.5.x it seems. So the only leak in 3.4.2 is the one mentioned in comment-1
REVIEW: http://review.gluster.org/8541 (cluster/afr: Fix memory leak of file-path in self-heal-daemon) posted (#1) for review on release-3.4 by Pranith Kumar Karampuri (email@example.com)
REVIEW: http://review.gluster.org/8541 (cluster/afr: Fix memory leak of file-path in self-heal-daemon) posted (#2) for review on release-3.4 by Pranith Kumar Karampuri (firstname.lastname@example.org)
COMMIT: http://review.gluster.org/8541 committed in release-3.4 by Kaleb KEITHLEY (email@example.com)
Author: Pranith Kumar K <firstname.lastname@example.org>
Date: Tue Aug 26 12:59:47 2014 +0530
cluster/afr: Fix memory leak of file-path in self-heal-daemon
Backport of http://review.gluster.org/4790
Note: Only the part which fixes the memory leak is backported
shd event has path which needs to be freed as part of circular buffer cleanup.
This patch introduces the functionality so that self-heal-daemon can use it.
Signed-off-by: Pranith Kumar K <email@example.com>
Tested-by: Gluster Build System <firstname.lastname@example.org>
Reviewed-by: Ravishankar N <email@example.com>
Reviewed-by: Kaleb KEITHLEY <firstname.lastname@example.org>