Bug 1311377 - Memory leak in glusterd
Summary: Memory leak in glusterd
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.7.0
Hardware: x86_64
OS: Linux
medium
urgent
Target Milestone: ---
Assignee: Gaurav Kumar Garg
QA Contact:
URL:
Whiteboard:
Depends On: 1287517
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-24 05:46 UTC by Gaurav Kumar Garg
Modified: 2016-06-05 23:38 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.7.9
Doc Type: Bug Fix
Doc Text:
Clone Of: 1287517
Environment:
Last Closed: 2016-04-19 07:20:19 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Gaurav Kumar Garg 2016-02-24 05:46:31 UTC
+++ This bug was initially created as a clone of Bug #1287517 +++

Description of problem:
I have a deployment with 3 nodes and 6 bricks.
Tried to change glusterfs version but problem persists.
glusterd process memory consumption is increasing constantly.
Using top to check its RSS values.
On a fresh installation started with 27M.
After 3 hours it was 71M.
Today is 515M.
Valgrind file attaches shows some possibly lost records.
Tried to echo 2 at /proc/sys/vm/drop_caches but no effect in memory consumption.
Consumption seems to increase if self-heal daemon is closed.

Version-Release number of selected component (if applicable):
Tried with 3.6.2, 3.7.5 and 3.7.6

How reproducible:
Install a 3 node cluster.
glusterd in storage master is increasing memory consumption.

Additional info:

--- Additional comment from Atin Mukherjee on 2015-12-02 04:52:08 EST ---

(In reply to kmak from comment #0)
> Created attachment 1101371 [details]
> valgrind report
> 
> Description of problem:
> I have a deployment with 3 nodes and 6 bricks.
> Tried to change glusterfs version but problem persists.
What command did you run here? Can you please take statedump of glusterd process by executing 'kill -USR1 $(pidof glusterd)' and it would generate the file in /var/run/gluster. Also please attach *glusterd.log & cmd_history.log.

> glusterd process memory consumption is increasing constantly.
> Using top to check its RSS values.
> On a fresh installation started with 27M.
> After 3 hours it was 71M.
> Today is 515M.
> Valgrind file attaches shows some possibly lost records.
> Tried to echo 2 at /proc/sys/vm/drop_caches but no effect in memory
> consumption.
> Consumption seems to increase if self-heal daemon is closed.
> 
> Version-Release number of selected component (if applicable):
> Tried with 3.6.2, 3.7.5 and 3.7.6
> 
> How reproducible:
> Install a 3 node cluster.
> glusterd in storage master is increasing memory consumption.
> 
> Additional info:

--- Additional comment from kmak on 2015-12-02 05:46 EST ---

Logs and statedump of glusterd

--- Additional comment from kmak on 2015-12-02 05:47:04 EST ---

Thanks for fast response

--- Additional comment from kmak on 2015-12-04 03:24:10 EST ---

Do we have any updates on this?

--- Additional comment from Atin Mukherjee on 2015-12-06 23:03:40 EST ---

(In reply to kmak from comment #4)
> Do we have any updates on this?

You should be able to hear something from us in couple of days time. We haven't got a chance to look at it yet.

--- Additional comment from Vijay Bellur on 2015-12-09 09:44:14 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#1) for review on master by Gaurav Kumar Garg (ggarg)

--- Additional comment from Gaurav Kumar Garg on 2015-12-10 02:24:22 EST ---

Hi kmak,

the  statedump + log files  file which you have attach is binary file its not in human readable formate. could you attach correct  statedump + log files  file.

one more thing

did you check memory leak of glusterd or glusterfsd or both 

could you give us information how did you check this memory leak.

--- Additional comment from kmak on 2015-12-10 02:45 EST ---



--- Additional comment from kmak on 2015-12-10 02:47 EST ---



--- Additional comment from kmak on 2015-12-10 02:47 EST ---



--- Additional comment from kmak on 2015-12-10 02:48 EST ---



--- Additional comment from kmak on 2015-12-10 02:50:39 EST ---

Hello, 
Yes sorry I uploaded them initially as a zip file.
Checking was done through top, process was glusterd.
glusterfs processes were using a stable amount of memory if I remember correctly.
This review which is mentioned above is correlated with my bug? 
(http://review.gluster.org/12927)

--- Additional comment from Vijay Bellur on 2015-12-10 06:53:10 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#2) for review on master by Gaurav Kumar Garg (ggarg)

--- Additional comment from Gaurav Kumar Garg on 2015-12-10 07:06:22 EST ---

Hi kmak,

ya there is memory leak for sure but its not in MB i saw its in KB. I have created 6 distributed volume having 2 node in the cluster and after executing 1000 gluster command i saw only 10KB memory leak. Wondering how memory leak happening in MB in your system. 

Will further analysis this issue.

--- Additional comment from Roman Tereshonkov on 2015-12-10 07:52:30 EST ---

I observed that memory leak can be increased by continuous run
for ((N=0;N<2000;N++)) ; do gluster volume status all ; done

Making the rate of leak of the order 1M/min.


In addition, making some hard coding I saw the cases when dict_t refcount was -1.

--- Additional comment from Gaurav Kumar Garg on 2015-12-10 08:09:40 EST ---

just now i have executed loop from 0 to 2000 and saw that 22KB memory leak. will further check for that 22KB memory leak. how can your system cause 1M/min memory leak. will check it further.

--- Additional comment from  on 2015-12-10 08:27:05 EST ---

Hi! 

I have the same problem on my Debian (Jessie) system!

I have tested my cluster (Distributed-Replicated, 2x2 brick) by the following bash script from 5 clients simultaneously:

#!/bin/bash

while true;
do

        for (( c=1; c<=100; c++ ))
        do
                dd if=/dev/zero of=/mnt/vol_www/960b_${HOSTNAME}_$c.txt bs=960 count=1
        done

        sleep 3

        rm -f /mnt/vol_www/960b_$HOSTNAME*

        sleep 3

done

The consumed memory grows very quickly!
(More than 1MB/min)

--- Additional comment from  on 2015-12-10 09:32:56 EST ---

Hi!

I have an update!

I did "echo 2 > /proc/sys/vm/drop_caches" and the memory reduced back to "normal" level!

Also tried some similar like in comment 15 (gluster volume status)!

My test was:

on console 1: watch -n 0.1 'gluster volume status'
on console 2: watch -n 0.1 'gluster volume status vol_one clients'

It was running for 15-20 minutes!

After that I did "echo 2 > /proc/sys/vm/drop_caches" again, but
the memory did not reduced back to "normal" level at this time!

--- Additional comment from kmak on 2015-12-15 05:45:04 EST ---

Do you have any findings on this? 
Can you tell us what is the best way to debug it?

--- Additional comment from Atin Mukherjee on 2015-12-15 07:25:02 EST ---

(In reply to kmak from comment #19)
> Do you have any findings on this? 
> Can you tell us what is the best way to debug it?

We haven't found the RCA of this as its not reproducible at our end. Although we have identified few places in the code where it could lead to a small chunk of memory leaks but that's not that severe as you have reported. 

I've one more follow up question which is that you mentioned that you failed to change the version. Did you mean that you tried to downgrade?

--- Additional comment from kmak on 2015-12-15 09:56:16 EST ---

We tried the same configuration with GlusterFS 3.6.2, 3.7.5 and 3.7.6.
So, do you think that this could be a configuration issue?
We surely have a lot of network activity and a lot of nodes.

--- Additional comment from Atin Mukherjee on 2015-12-15 22:39:11 EST ---

(In reply to kmak from comment #21)
> We tried the same configuration with GlusterFS 3.6.2, 3.7.5 and 3.7.6.
> So, do you think that this could be a configuration issue?
> We surely have a lot of network activity and a lot of nodes.

If you mean to say you are hitting memory leak through out different versions then I am inline with what you mentioned that configuration could be a suspect. Can you upload the complete sosreport of the node where you see the memory leak? Are you sure that you have not modified any configurations in glusterd.vol ?

--- Additional comment from Roman Tereshonkov on 2015-12-16 04:01:51 EST ---

The problem is reproducible for glusterfs 3.7.6 using the simple two nodes cluster.
The proposed fix http://review.gluster.org/#/c/12927/ has no influence.

The setup:
1. any two nodes with network communication (node1, node2);
2. glusterfs 3.7.6 (./configure --prefix=/usr --libdir="/usr/lib64" --sysconfdir=/etc --localstatedir=/var && make && make install)
3. rdma is not used (is it critical or not ?). glusterd.vol: option transport-type socket
4. simple volume configuration for nodes:
     mkdir -p /brick/brick1
     gluster peer probe node1
     gluster volume create MYVOL replica 2 node1:/brick/brick1 node2:/brick/brick1 force
     gluster volume start MYVOL
     mount -t glusterfs node1:MYVOL /mnt/test (indeed I do not know if mounting is needed or not to see the bug, not tested yet)
5. for ((N=0;N<20000;N++)) ; do gluster volume status all ; done
6. "top" to observe glusterd RSS memory growing (and it grows in both nodes)


NB: If you are using only one node with this setup the problem is not reproduced.
    It look like two nodes communicating by network is the important precondition.

--- Additional comment from kmak on 2015-12-17 04:32:41 EST ---

Hello,
problem is verified also from my side (my colleague's Roman comment 23 is also valid) in my laptop (using RHEL 7.2) in two VMs (node1, node2)

Description of my setup:
1) Create two nodes connected to a common network.
2) Create two extra partitions, format them to ext4 and mount them to /data/brick/gv0/gv in both servers
3) Download glusterfs repo from here: http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/glusterfs-epel.repo & install glusterfs-server
4) gluster peer probe node1/node2 to both nodes
5) gluster volume create gv0 replica 2 node1:/data/brick/gv0/gv node2:/data/brick/gv0/gv 
6) gluster volume start gv0
7) Run "gluster volume status all" x 2000 times every 30sec
8) See glusterd memory (rss taken from ps aux) grow constantly with 1MB/min

I will attach a result log with the first runs.

--- Additional comment from kmak on 2015-12-17 04:35 EST ---



--- Additional comment from Roman Tereshonkov on 2015-12-22 09:42 EST ---

Fix glusterd 3.7.6 memory leaks in the serialized buffer.

Memory is allocated by dict_allocate_and_serialize function and
pointer is assigned to req.dict.dict_val variable.
On exit this memory has to be freed.

These leaks can be observed by command usage:
glusterd volume status all
The order of leaks ~ 2K per command run.

--- Additional comment from Roman Tereshonkov on 2015-12-22 09:52:29 EST ---

The attached patch fixes the problem partially.
It reduces the memory leaks by about 5 times.
I did not observe any visible leaks on host sending "glusterd volume status all" command.
But the receiving host still has some small leaks.

I suspect the problem is somewhere in glusterd-op-sm.c file.
The volume status command initiates two transactions. 
For each transaction glusterd_set_txn_opinfo function is called and 
opinfo_obj = GF_CALLOC (1, sizeof(glusterd_txn_opinfo_obj), gf_common_mt_txn_opinfo_obj_t);
memory is allocated.
On finnishing glusterd_clear_txn_opinfo function is called to free allocated memory. But it happens only once for the second transaction.
So memory allocated for the first transaction is kept unallocated.

Can anybody check this and review?

--- Additional comment from Gaurav Kumar Garg on 2015-12-22 13:22:09 EST ---

Hi roman,

Thanks for your efforts. 

i will check it and get back to you.

~Gaurav

--- Additional comment from Gaurav Kumar Garg on 2015-12-23 02:37:08 EST ---

Hi roman,

Thank you for this finding. That actually solve the memory leak problem. i have tested it and find out that there are only few byte of memory leak after this patch. 

regarding your suspension in glusterd-op-sm.c file volume status is initiate transaction and call glusterd_set_txn_opinfo function and it call the glusterd_clear_txn_opinfo same time.

--- Additional comment from Vijay Bellur on 2015-12-23 03:46:47 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#3) for review on master by Gaurav Kumar Garg (ggarg)

--- Additional comment from Roman Tereshonkov on 2015-12-23 03:48:04 EST ---

> regarding your suspension in glusterd-op-sm.c file volume status is initiate transaction and call glusterd_set_txn_opinfo function and it call the glusterd_clear_txn_opinfo same time.

According to the logs glusterd_clear_txn_opinfo is not called for the first transacation.
This chunk of log created on local host by running volume status on remote host.
As you can see "cleared opinfo" is called only for the second transaction.
What is your case. Do you see two "cleared opinfo" in your log?

[2015-12-22 11:09:01.321481] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.328820] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.343665] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.356197] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.366702] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.371448] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.382440] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.433093] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.440280] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.452016] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.468647] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.469808] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.472555] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.477988] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.482711] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.484789] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.496658] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.504998] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.509757] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.511833] D [MSGID: 0] [glusterd-op-sm.c:360:glusterd_clear_txn_opinfo] 0-management: Successfully cleared opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556

--- Additional comment from Vijay Bellur on 2015-12-23 03:54:43 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#4) for review on master by Gaurav Kumar Garg (ggarg)

--- Additional comment from Gaurav Kumar Garg on 2015-12-23 04:18:31 EST ---

(In reply to Roman Tereshonkov from comment #31)
> > regarding your suspension in glusterd-op-sm.c file volume status is initiate transaction and call glusterd_set_txn_opinfo function and it call the glusterd_clear_txn_opinfo same time.
> 
> According to the logs glusterd_clear_txn_opinfo is not called for the first
> transacation.
> This chunk of log created on local host by running volume status on remote
> host.
> As you can see "cleared opinfo" is called only for the second transaction.
> What is your case. Do you see two "cleared opinfo" in your log?
> 
> [2015-12-22 11:09:01.321481] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.328820] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.343665] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.356197] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.366702] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.371448] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.382440] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.433093] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.440280] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.452016] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.468647] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.469808] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.472555] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.477988] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.482711] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.484789] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.496658] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.504998] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.509757] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.511833] D [MSGID: 0]
> [glusterd-op-sm.c:360:glusterd_clear_txn_opinfo] 0-management: Successfully
> cleared opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556

Hi roman,

i my test case was using gdb. i just put glusterd process in gdb and put a break point in these two function that you have mentioned (glusterd_set_txn_opinfo and glusterd_clear_txn_opinfo) and executed gluster volume status command. i saw that both function called same number of time.

--- Additional comment from Vijay Bellur on 2015-12-23 04:27:24 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#5) for review on master by Gaurav Kumar Garg (ggarg)

--- Additional comment from Roman Tereshonkov on 2015-12-23 04:42:06 EST ---

> i my test case was using gdb. i just put glusterd process in gdb and put a break point in these two function that you have mentioned (glusterd_set_txn_opinfo and glusterd_clear_txn_opinfo) and executed gluster volume status command. i saw that both function called same number of time.


If it is not confidential can you attach the chunk of debug log /var/log/glusterfs/etc-glusterfs-glusterd.vol.log (--log-level=DEBUG) from both local and remote hosts where you make one run of the command "gluster volume status"?

--- Additional comment from Gaurav Kumar Garg on 2015-12-23 07:29:43 EST ---

Hi roman,

yes i still saw some memory leak at remote side. there is no memory leak at originator node side. rca seems to be set_txn_opinfo are calling more number of time compare to clear_txn_opinfo. will update you about it soon.

--- Additional comment from Gaurav Kumar Garg on 2015-12-24 01:28:27 EST ---

hi roman,

there is one update regarding this.

if by looking into glusterd log's you are saying that memory allocated by glusterd_set_txn_opinfo call too many number of time compare to glusterd_clear_txn_opinfo then its wrong. its true that glusterd_set_txn_opinfo calling more number of time at remote node compare to glusterd_clear_txn_opinfo but it doesn't mean that every time its allocate memory. from the function (glusterd_set_txn_opinfo) you can see that if ret is non zero (means its unable to get txn opinfo from the dictionary) then only it will allocate memory and it happen only one's at the starting of transaction. 


if you apply http://review.gluster.org/12927 patch at all the node in the cluster then you will see only few byte of memory leak. 

i am checking it further why few byte of memory leak happening. will update you further.

--- Additional comment from Roman Tereshonkov on 2015-12-28 04:06:28 EST ---

About http://review.gluster.org/#/c/12927/

Why have you moved GF_FREE (req.dict.dict_val) before "out:"?
If error comes after GD_ALLOC_COPY_UUID then the allocated memory leaks.
In my patch proposed it was after "out:" statement.

--- Additional comment from Vijay Bellur on 2015-12-29 01:42:14 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#6) for review on master by Gaurav Kumar Garg (ggarg)

--- Additional comment from Roman Tereshonkov on 2015-12-29 07:35:23 EST ---

> if by looking into glusterd log's you are saying that memory allocated by glusterd_set_txn_opinfo call too many number of time compare to glusterd_clear_txn_opinfo then its wrong. its true that glusterd_set_txn_opinfo calling more number of time at remote node compare to glusterd_clear_txn_opinfo but it doesn't mean that every time its allocate memory. from the function (glusterd_set_txn_opinfo) you can see that if ret is non zero (means its unable to get txn opinfo from the dictionary) then only it will allocate memory and it happen only one's at the starting of transaction. 


By transaction I mean a sequence of commands with the same transaction id.

As I mentioned earlier every run of "volume status all" generates two different transactions with different transaction ids.
glusterd_set_txn_opinfo is called with "transaction id" as argument and allocates opinfo every time when there is no such transaction opinfo in dictionary.
glusterd_clear_txn_opinfo is called with "transaction id" as argument and frees opinfo with such id from dictionary glusterd_txn_opinfo.

In our case the glusterd_clear_txn_opinfo function is called only once with "transaction id" corresponding to the second transaction.

As a provement you can add "dict_dump_to_log(priv->glusterd_txn_opinfo);" at the end of function glusterd_clear_txn_opinfo to see the content of the dictionary.
You will see that after every run of "volume status all" the dictionary glusterd_txn_opinfo grows keeping all unallocated opinfo identified by the every first transaction id.


By the way if you run "volume status VOLNAME" with existed VOLNAME then only one transaction is generated and it looks like no memory leaks.

--- Additional comment from Roman Tereshonkov on 2015-12-30 06:29:56 EST ---

It looks like it is a functional problem.

The issued command "gluster volume status all" generates a series of transaction identified by different ids.
The very first one is to get the volume names supported by remote host glusterfs server.
The others transaction are issued per each volume name and include locking/unlocking phases.
And it is the unlocking phase that is the marker that the transaction ended and all allocated resources are to be free.
There is no other code pathway to call glusterd_clear_txn_opinfo function except unlock or unlock_all pathways.
Thus, there is no any way for the very first transaction to know when it is ended to call glusterd_clear_txn_opinfo and free opinfo.

This is my analysis which is needed to be confirmed by you.
Logically it is different issue now from what we have already fixed.

Should I open new bug for this functional problem?

Do you see any simple way to fix it?

--- Additional comment from Gaurav Kumar Garg on 2015-12-30 06:37:05 EST ---

Hi Roman, 

ya its a functional problem for "#gluster volume status all" command. but it need to be fixed. ya analysis is correct. will update you about it. no need to raise separate bug as of this as of now.

--- Additional comment from kmak on 2016-02-02 09:23:02 EST ---

Hello,
Is there any update for this bug?
Do you have implemented any fix for it and in which version?
Thanks

--- Additional comment from Gaurav Kumar Garg on 2016-02-04 05:26:19 EST ---

Hi Kmak,

patch http://review.gluster.org/12927 fix most of the memory leak. but still there is some memory leak. Its seems to be in our op-sm framework. This patch still not available in latest GlusterFS version.

--- Additional comment from Gaurav Kumar Garg on 2016-02-04 05:26:53 EST ---

Hi Kmak,

patch http://review.gluster.org/12927 fix most of the memory leak. but still there is some memory leak. Its seems to be in our op-sm framework. This patch still not available in latest GlusterFS version.

--- Additional comment from Vijay Bellur on 2016-02-22 06:27:38 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#7) for review on master by Gaurav Kumar Garg (ggarg)

--- Additional comment from Roman Tereshonkov on 2016-02-22 06:33:39 EST ---

Has that two lines memory leak fix patch been accepted into upstream 3.7?

What is the status of the functional issue based memory leak. Any progress to fix "gluster volume status all" issue?

--- Additional comment from Gaurav Kumar Garg on 2016-02-23 01:52:44 EST ---

Hi roman, 

patch is still in review. It have not merged yet to upstream.

--- Additional comment from Vijay Bellur on 2016-02-23 23:19:15 EST ---

COMMIT: http://review.gluster.org/12927 committed in master by Jeff Darcy (jdarcy) 
------
commit e38bf1bdeda3c7a89be3193ad62a72b9139358dd
Author: Gaurav Kumar Garg <garg.gaurav52>
Date:   Wed Dec 9 20:12:17 2015 +0530

    glusterd: fixing few memory leak in glusterd
    
    Current glusterd code base having memory leak. This is because of
    memory allocate by dict_allocate_and_serialize function in
    "gd_syncop_mgmt_v3_lock" and "gd_syncop_mgmt_v3_unlock"
    function is not freeing up meory upon exit.
    
    Fix is to free the memory after exit of the above function.
    
    Thanx Carlos and Roman for finding out the issue and fix.
    
    Change-Id: Id67aa794c84969830ca7ea8c2374f80c64d7a639
    BUG: 1287517
    Signed-off-by: Gaurav Kumar Garg <ggarg>
    Signed-off-by: Carlos Chinea <carlos.chinea>
    Signed-off-by: Roman Tereshonkov <roman.tereshonkov>
    Reviewed-on: http://review.gluster.org/12927
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 1 Vijay Bellur 2016-02-24 05:48:04 UTC
REVIEW: http://review.gluster.org/13503 (glusterd: fixing few memory leak in glusterd) posted (#1) for review on release-3.7 by Gaurav Kumar Garg (ggarg)

Comment 2 Vijay Bellur 2016-02-29 09:04:43 UTC
REVIEW: http://review.gluster.org/13503 (glusterd: fixing few memory leak in glusterd) posted (#2) for review on release-3.7 by Gaurav Kumar Garg (ggarg)

Comment 3 Vijay Bellur 2016-03-03 07:30:28 UTC
REVIEW: http://review.gluster.org/13503 (glusterd: fixing few memory leak in glusterd) posted (#3) for review on release-3.7 by Gaurav Kumar Garg (ggarg)

Comment 4 Vijay Bellur 2016-03-04 04:25:13 UTC
REVIEW: http://review.gluster.org/13503 (glusterd: fixing few memory leak in glusterd) posted (#4) for review on release-3.7 by Atin Mukherjee (amukherj)

Comment 5 Vijay Bellur 2016-03-07 06:43:25 UTC
REVIEW: http://review.gluster.org/13503 (glusterd: fixing few memory leak in glusterd) posted (#5) for review on release-3.7 by Gaurav Kumar Garg (ggarg)

Comment 6 Vijay Bellur 2016-03-08 04:01:28 UTC
REVIEW: http://review.gluster.org/13503 (glusterd: fixing few memory leak in glusterd) posted (#6) for review on release-3.7 by Gaurav Kumar Garg (ggarg)

Comment 7 Vijay Bellur 2016-03-09 07:30:26 UTC
REVIEW: http://review.gluster.org/13503 (glusterd: fixing few memory leak in glusterd) posted (#7) for review on release-3.7 by Gaurav Kumar Garg (ggarg)

Comment 8 Vijay Bellur 2016-03-09 19:32:30 UTC
REVIEW: http://review.gluster.org/13503 (glusterd: fixing few memory leak in glusterd) posted (#8) for review on release-3.7 by Vijay Bellur (vbellur)

Comment 9 Vijay Bellur 2016-03-10 15:20:09 UTC
COMMIT: http://review.gluster.org/13503 committed in release-3.7 by Vijay Bellur (vbellur) 
------
commit af7428490253421484e61409ddb77b653c4a5e63
Author: Gaurav Kumar Garg <garg.gaurav52>
Date:   Wed Dec 9 20:12:17 2015 +0530

    glusterd: fixing few memory leak in glusterd
    
    This patch is backport of: http://review.gluster.org/#/c/12927/
    
    Current glusterd code base having memory leak. This is because of
    memory allocate by dict_allocate_and_serialize function in
    "gd_syncop_mgmt_v3_lock" and "gd_syncop_mgmt_v3_unlock"
    function is not freeing up memory upon exit.
    
    Fix is to free the memory after exit of the above function.
    
    Thanks Carlos and Roman for finding out the issue and fix.
    
     >> Change-Id: Id67aa794c84969830ca7ea8c2374f80c64d7a639
     >> BUG: 1287517
     >> Signed-off-by: Gaurav Kumar Garg <ggarg>
     >> Signed-off-by: Carlos Chinea <carlos.chinea>
     >> Signed-off-by: Roman Tereshonkov <roman.tereshonkov>
     >> Reviewed-on: http://review.gluster.org/12927
     >> Smoke: Gluster Build System <jenkins.com>
     >> NetBSD-regression: NetBSD Build System <jenkins.org>
     >> CentOS-regression: Gluster Build System <jenkins.com>
     >> Reviewed-by: Jeff Darcy <jdarcy>
    
    Change-Id: Id67aa794c84969830ca7ea8c2374f80c64d7a639
    BUG: 1311377
    Signed-off-by: Gaurav Kumar Garg <ggarg>
    (cherry picked from commit e38bf1bdeda3c7a89be3193ad62a72b9139358dd)
    Reviewed-on: http://review.gluster.org/13503
    Smoke: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>
    Tested-by: Vijay Bellur <vbellur>
    CentOS-regression: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 10 Kaushal 2016-04-19 07:20:19 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.9, please open a new bug report.

glusterfs-3.7.9 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-March/025922.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.