Description of problem: On a 3 node containerized gluster cluster with brick multiplexing enabled, with 500 volumes created, started and mounted, memory consumption on the gluster node seems to be slowly raising without any IO operation being run on any of the volumes. glusterfsd process seems to be consuming 60% of memory i.e., 28 GB of 48 GB of available memory. Although it is not clear if there is actually a leak, filing this bug so dev can check if there is one. I've collected statedump for one of the volume with a gap of 2 days. I'll be attaching them shortly. How reproducible: Yet to try Steps to Reproduce: 1. create 3 node containerized gluster cluster 2. enable brick multiplexing - cluster.brick-multiplex on 3. create 500 volume and monitor memory consumption of glusterfsd process
memleak issue seems to be a legitimate one. When IO was started and ran for a while, memory consumption increased and stayed at the same level even when IO was stopped.
I've taken statedump for one of the volume once again after running IOs and attached.
Looking at the differences between the statedumps, these two stand out: protocol/server.vol1-server gf_common_mt_inode_ctx: 4000 -> 54000 protocol/server.vol1-server gf_common_mt_strdup: 16007 -> 66007 So, exactly 50K of each, both from protocol/server. This seems consistent with a memory leak when clients reconnect, if they do so many times, which raises two questions. (1) Where *exactly* is the leak (or possibly two leaks)? (2) Why do clients keep reconnecting? The answer to the second question, unfortunately, might be that our network layer simply isn't capable of handling that many connections, creating queue effects that cause clients to time out. Can you check for that in the client logs? Or maybe for a consistent interval between disconnect/reconnect cycles? Also, have you checked whether this happens *without* multiplexing, given the same rate of reconnections? I have a strong suspicion that it would, and that the leak has been latent for a long time until multiplexing made it visible.
Hi Jeff, Do you think one of the way to mitigate problem 2 mentioned in comment 5 can be implementing https://github.com/gluster/glusterfs/issues/151 ?
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained. As a result this bug is being closed. If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.
clearing stale needinfos.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days