Bug 1467986 - possible memory leak in glusterfsd with multiplexing
Summary: possible memory leak in glusterfsd with multiplexing
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: mainline
Hardware: All
OS: All
high
high
Target Milestone: ---
Assignee: Mohammed Rafi KC
QA Contact:
URL:
Whiteboard:
Depends On: 1426291
Blocks: 1457936
TreeView+ depends on / blocked
 
Reported: 2017-07-05 17:48 UTC by Mohammed Rafi KC
Modified: 2017-10-26 14:35 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.12.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1426291
Environment:
Last Closed: 2017-09-05 17:35:44 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Mohammed Rafi KC 2017-07-05 17:48:57 UTC
+++ This bug was initially created as a clone of Bug #1426291 +++

Description of problem:
On a 3 node containerized gluster cluster with brick multiplexing enabled, with 500 volumes created, started and mounted, memory consumption on the gluster node seems to be slowly raising without any IO operation being run on any of the volumes. glusterfsd process seems to be consuming 60% of memory i.e., 28 GB of 48 GB of available memory.

Although it is not clear if there is actually a leak, filing this bug so dev can check if there is one.

I've collected statedump for one of the volume with a gap of 2 days. I'll be attaching them shortly.

How reproducible:
Yet to try

Steps to Reproduce:
1. create 3 node containerized gluster cluster
2. enable brick multiplexing - cluster.brick-multiplex on
3. create 500 volume and monitor memory consumption of glusterfsd process

--- Additional comment from krishnaram Karthick on 2017-02-23 10:58:09 EST ---

sharing setup details,

10.70.47.29 - root/aplo
10.70.47.31 - root/aplo
10.70.46.128 - root/aplo

gluster is run as a container in each of these nodes. 

[root@dhcp47-29 ~]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
5a9cb8a8f68f        712fe8824a9e        "/usr/sbin/init"    6 days ago          Up 6 days                               glusternode1
[root@dhcp47-29 ~]# docker exec -it 5a9cb8a8f68f /bin/bash
[root@dhcp47-29 /]# gluster pool list
UUID					Hostname                         	State
5f71e939-a65f-457b-b178-41d652d4e104	dhcp46-128.lab.eng.blr.redhat.com	Connected 
682ee5bc-9a22-4376-a15f-baa34ec30532	10.70.47.31                      	Connected 
68036246-08a0-4c27-8b5f-e4636d5c141b	localhost                        	Connected

--- Additional comment from krishnaram Karthick on 2017-02-23 11:02:19 EST ---

sosreports are available here --> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1426291/

*.dump1 was collected initially and *.dump2 was collected after a gap of 2 days.

--- Additional comment from krishnaram Karthick on 2017-03-02 08:40:09 EST ---

memleak issue seems to be a legitimate one. When IO was started and ran for a while, memory consumption increased and stayed at the same level even when IO was stopped.

--- Additional comment from krishnaram Karthick on 2017-03-02 23:23:03 EST ---

I've taken statedump for one of the volume once again after running IOs and attached.

--- Additional comment from Jeff Darcy on 2017-03-09 14:25:56 EST ---

Looking at the differences between the statedumps, these two stand out:

   protocol/server.vol1-server gf_common_mt_inode_ctx: 4000 -> 54000
   protocol/server.vol1-server gf_common_mt_strdup: 16007 -> 66007

So, exactly 50K of each, both from protocol/server.  This seems consistent with a memory leak when clients reconnect, if they do so many times, which raises two questions.

(1) Where *exactly* is the leak (or possibly two leaks)?

(2) Why do clients keep reconnecting?

The answer to the second question, unfortunately, might be that our network layer simply isn't capable of handling that many connections, creating queue effects that cause clients to time out.  Can you check for that in the client logs?  Or maybe for a consistent interval between disconnect/reconnect cycles?  Also, have you checked whether this happens *without* multiplexing, given the same rate of reconnections?  I have a strong suspicion that it would, and that the leak has been latent for a long time until multiplexing made it visible.

--- Additional comment from Atin Mukherjee on 2017-06-05 11:19:47 EDT ---

Hi Jeff,

Do you think one of the way to mitigate problem 2 mentioned in comment 5 can be implementing https://github.com/gluster/glusterfs/issues/151 ?

Comment 1 Worker Ant 2017-07-05 17:50:19 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : implement sha hash function for volfile check) posted (#1) for review on master by mohammed rafi  kc (rkavunga)

Comment 2 Worker Ant 2017-07-05 19:14:24 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : implement sha hash function for volfile check) posted (#2) for review on master by mohammed rafi  kc (rkavunga)

Comment 3 Worker Ant 2017-07-05 21:05:46 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : implement sha hash function for volfile check) posted (#3) for review on master by mohammed rafi  kc (rkavunga)

Comment 4 Worker Ant 2017-07-06 09:03:28 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : implement sha hash function for volfile check) posted (#4) for review on master by mohammed rafi  kc (rkavunga)

Comment 5 Worker Ant 2017-07-06 09:09:22 UTC
REVIEW: https://review.gluster.org/17709 (glusterfs/mgmt : implement sha hash function for volfile check) posted (#5) for review on master by mohammed rafi  kc (rkavunga)

Comment 6 Worker Ant 2017-07-06 09:48:03 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : implement sha hash function for volfile check) posted (#6) for review on master by mohammed rafi  kc (rkavunga)

Comment 7 Worker Ant 2017-07-06 12:10:02 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : implement sha hash function for volfile check) posted (#7) for review on master by mohammed rafi  kc (rkavunga)

Comment 8 Worker Ant 2017-07-06 15:21:04 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : use sha hash function for volfile check) posted (#8) for review on master by mohammed rafi  kc (rkavunga)

Comment 9 Worker Ant 2017-07-06 20:02:14 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : use sha hash function for volfile check) posted (#9) for review on master by mohammed rafi  kc (rkavunga)

Comment 10 Worker Ant 2017-07-06 20:06:47 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : use sha hash function for volfile check) posted (#10) for review on master by mohammed rafi  kc (rkavunga)

Comment 11 Worker Ant 2017-07-07 07:02:30 UTC
REVIEW: https://review.gluster.org/17709 (mgtm/core : use sha hash function for volfile check) posted (#11) for review on master by mohammed rafi  kc (rkavunga)

Comment 12 Worker Ant 2017-07-10 05:07:15 UTC
COMMIT: https://review.gluster.org/17709 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit f2f3d74c835b68ad9ec63ec112870829a823a1fb
Author: Mohammed Rafi KC <rkavunga>
Date:   Thu Jul 6 13:26:42 2017 +0530

    mgtm/core : use sha hash function for volfile check
    
    We are storing the entire volfile and using this to check
    volfile change. With brick multiplexing there will be lot
    of graphs per process which will increase the memory foot
    print of the process. So instead of storing the entire
    graph we could use sha256 and we can compare the hash to
    see whether volfile change happened or not.
    
    Also with Brick multiplexing, the direct comparison of vol
    file is not correct. There are two problems.
    
    Problem 1:
    
    We are currently storing one single graph (the last
    updated volfile) whereas, what we need is the entire
    graph with all atttached bricks.
    
    If we fix this issue, we have second problem
    
    Problem 2:
    With multiplexing we have a graph that contains multiple
    bricks. But what we are checking as part of the reconfigure
    is, comparing the entire graph with one single graph,
    which will always fail.
    
    Solution:
    We create list in glusterfs_ctx_t that stores sha256 hash
    of individual brick graphs. When a graph changes happens
    we compare the stored hash and the current hash. If the
    hash matches, then no need for reconfigure. Otherwise we
    first do the reconfigure and then update the hash.
    
    For now, gfapi has not changed this way. Meaning when gfapi
    volfile fetch or reconfigure happens, we still store the
    entire graph and compare, each memory.
    
    This is fine, because libgfapi will not load brick graphs.
    But changing the libgfapi will make the code similar in
    both glusterfsd-mgmt and api. Also it helps to reduce some
    memory.
    
    Change-Id: I9df917a771a52b95622ab8f63af34ec390163a77
    BUG: 1467986
    Signed-off-by: Mohammed Rafi KC <rkavunga>
    Reviewed-on: https://review.gluster.org/17709
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>
    Reviewed-by: Amar Tumballi <amarts>

Comment 13 Shyamsundar 2017-09-05 17:35:44 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.