Hide Forgot
Today we found one of our six gluster nodes in a state where a glusterfs process was consuming most of memory: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 31918 24.1 95.4 104567776 78867272 ? Ssl Mar15 7606:14 \ /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p \ /var/lib/glusterd/glustershd/run/glustershd.pid -l \ /var/log/glusterfs/glustershd.log -S \ /var/run/9f67821258ca7bb33117f9c9ec46e8d3.socket --xlator-option \ *replicate*.node-uuid=d05444b0-6034-403e-a9f9-59a7a9428d0e At first I tried to stop gluster with, "service glusterd stop", but that failed. So I had to kill that process with "kill -TERM 31918", and then had to kill all the other processes before I could get a "service gluster start" to work properly. I don't know what caused this. Info: Volume Name: pbench Type: Distributed-Replicate Volume ID: 688b4f86-9868-4fab-ab0e-7341404c762d Status: Started Snap Volume: no Number of Bricks: 12 x 3 = 36 Transport-type: tcp Bricks: Brick1: gprfs001-b-10ge:/brick/pbench0-brick/pbench Brick2: gprfs009-b-10ge:/brick/pbench0-brick/pbench Brick3: gprfs011-b-10ge:/brick/pbench0-brick/pbench Brick4: gprfs002-b-10ge:/brick/pbench0-brick/pbench Brick5: gprfs010-b-10ge:/brick/pbench0-brick/pbench Brick6: gprfs012-b-10ge:/brick/pbench0-brick/pbench Brick7: gprfs001-b-10ge:/brick/pbench1-brick/pbench Brick8: gprfs009-b-10ge:/brick/pbench1-brick/pbench Brick9: gprfs011-b-10ge:/brick/pbench1-brick/pbench Brick10: gprfs002-b-10ge:/brick/pbench1-brick/pbench.1 Brick11: gprfs010-b-10ge:/brick/pbench1-brick/pbench Brick12: gprfs012-b-10ge:/brick/pbench1-brick/pbench Brick13: gprfs001-b-10ge:/brick/pbench2-brick/pbench Brick14: gprfs009-b-10ge:/brick/pbench2-brick/pbench Brick15: gprfs011-b-10ge:/brick/pbench2-brick/pbench Brick16: gprfs002-b-10ge:/brick/pbench2-brick/pbench Brick17: gprfs010-b-10ge:/brick/pbench2-brick/pbench Brick18: gprfs012-b-10ge:/brick/pbench2-brick/pbench Brick19: gprfs001-b-10ge:/brick/pbench3-brick/pbench Brick20: gprfs009-b-10ge:/brick/pbench3-brick/pbench Brick21: gprfs011-b-10ge:/brick/pbench3-brick/pbench Brick22: gprfs002-b-10ge:/brick/pbench3-brick/pbench Brick23: gprfs010-b-10ge:/brick/pbench3-brick/pbench Brick24: gprfs012-b-10ge:/brick/pbench3-brick/pbench Brick25: gprfs001-b-10ge:/brick/pbench4.1-brick/pbench Brick26: gprfs009-b-10ge:/brick/pbench4-brick/pbench Brick27: gprfs011-b-10ge:/brick/pbench4-brick/pbench Brick28: gprfs002-b-10ge:/brick/pbench4-brick/pbench Brick29: gprfs010-b-10ge:/brick/pbench4-brick/pbench Brick30: gprfs012-b-10ge:/brick/pbench4-brick/pbench Brick31: gprfs001-b-10ge:/brick/pbench5-brick/pbench Brick32: gprfs009-b-10ge:/brick/pbench5-brick/pbench Brick33: gprfs011-b-10ge:/brick/pbench5-brick/pbench Brick34: gprfs002-b-10ge:/brick/pbench5-brick/pbench Brick35: gprfs010-b-10ge:/brick/pbench5-brick/pbench Brick36: gprfs012-b-10ge:/brick/pbench5-brick/pbench Options Reconfigured: diagnostics.brick-sys-log-level: CRITICAL performance.readdir-ahead: on performance.io-cache: off performance.stat-prefetch: on cluster.lookup-unhashed: off client.event-threads: 8 cluster.read-hash-mode: 2 auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 All hardware boxes in cluster (6 in all) are 2 socket, 12 core, 96GB of memory systems, with 12 1 TB disks, lashed together in pairs to create 6 bricks per host for a total of 36 bricks.
This is for RHGS 3.0.4, on RHEL 6.6.
This does NOT appear to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1247221, as no find operations were being performed on the local disks.
Changing the component to AFR as its the self heal daemon which is consuming this amount of memory.
After restarting gluster on that host, it has returned to growing in its memory use again, at 16 GB now. How do I safely restart gluster at this time to avoid the memory growth causing a problem?
Created attachment 1145815 [details] First statedump Attaching statedump #1
Created attachment 1145816 [details] Second Statedump
Note that memory leaks seem to be stemming from gf_strdup: [cluster/replicate.pbench-replicate-0 - usage-type 40 memusage] type=gf_common_mt_strdup size=2918102943 num_allocs=32222745 max_size=2918102943 [cluster/replicate.pbench-replicate-4 - usage-type 40 memusage] type=gf_common_mt_strdup size=2917406121 num_allocs=32216871 max_size=2917406121 [cluster/replicate.pbench-replicate-8 - usage-type 40 memusage] type=gf_common_mt_strdup size=2134711491 num_allocs=9109248 max_size=2134711786
Peter, In 3.0 afr-v1 was present where as from 3.1 onwards afr-v2 is present in self-heal daemon so the code is completely different. Since we are not going to make anymore releases on 3.0.x I am closing this bug for now. Please feel free to re-open/open new bug if you face same issue on 3.1 Pranith