1287517 – Memory leak in glusterd

Bug 1287517 - Memory leak in glusterd

Summary: Memory leak in glusterd

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Assignee:	Satish Mohan
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1311377
TreeView+	depends on / blocked

Reported:	2015-12-02 09:24 UTC by kmak
Modified:	2016-06-16 13:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.8rc2
Clone Of:
Clones:	1311377 (view as bug list)
Environment:
Last Closed:	2016-06-16 13:47:58 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
valgrind report (518.68 KB, text/plain) 2015-12-02 09:24 UTC, kmak	no flags	Details
statedump + log files (19.90 KB, text/plain) 2015-12-02 10:46 UTC, kmak	no flags	Details
cmd_history.log (57.82 KB, text/plain) 2015-12-10 07:45 UTC, kmak	no flags	Details
command history log (57.82 KB, text/plain) 2015-12-10 07:47 UTC, kmak	no flags	Details
etc-glusterfs-glusterd.log (147.44 KB, text/plain) 2015-12-10 07:47 UTC, kmak	no flags	Details
glusterdump (27.83 KB, text/plain) 2015-12-10 07:48 UTC, kmak	no flags	Details
Results file showing glusterd memory increase (2.11 KB, text/plain) 2015-12-17 09:35 UTC, kmak	no flags	Details
Fix glusterd 3.7.6 memory leaks in the serialized buffer. (1.58 KB, patch) 2015-12-22 14:42 UTC, Roman Tereshonkov	no flags	Details \| Diff
Show Obsolete (2) View All

Description kmak 2015-12-02 09:24:04 UTC

Created attachment 1101371 [details]
valgrind report

Description of problem:
I have a deployment with 3 nodes and 6 bricks.
Tried to change glusterfs version but problem persists.
glusterd process memory consumption is increasing constantly.
Using top to check its RSS values.
On a fresh installation started with 27M.
After 3 hours it was 71M.
Today is 515M.
Valgrind file attaches shows some possibly lost records.
Tried to echo 2 at /proc/sys/vm/drop_caches but no effect in memory consumption.
Consumption seems to increase if self-heal daemon is closed.

Version-Release number of selected component (if applicable):
Tried with 3.6.2, 3.7.5 and 3.7.6

How reproducible:
Install a 3 node cluster.
glusterd in storage master is increasing memory consumption.

Additional info:

Comment 1 Atin Mukherjee 2015-12-02 09:52:08 UTC

(In reply to kmak from comment #0)
> Created attachment 1101371 [details]
> valgrind report
> 
> Description of problem:
> I have a deployment with 3 nodes and 6 bricks.
> Tried to change glusterfs version but problem persists.
What command did you run here? Can you please take statedump of glusterd process by executing 'kill -USR1 $(pidof glusterd)' and it would generate the file in /var/run/gluster. Also please attach *glusterd.log & cmd_history.log.

> glusterd process memory consumption is increasing constantly.
> Using top to check its RSS values.
> On a fresh installation started with 27M.
> After 3 hours it was 71M.
> Today is 515M.
> Valgrind file attaches shows some possibly lost records.
> Tried to echo 2 at /proc/sys/vm/drop_caches but no effect in memory
> consumption.
> Consumption seems to increase if self-heal daemon is closed.
> 
> Version-Release number of selected component (if applicable):
> Tried with 3.6.2, 3.7.5 and 3.7.6
> 
> How reproducible:
> Install a 3 node cluster.
> glusterd in storage master is increasing memory consumption.
> 
> Additional info:

Comment 2 kmak 2015-12-02 10:46:27 UTC

Created attachment 1101412 [details]
statedump + log files

Logs and statedump of glusterd

Comment 3 kmak 2015-12-02 10:47:04 UTC

Thanks for fast response

Comment 4 kmak 2015-12-04 08:24:10 UTC

Do we have any updates on this?

Comment 5 Atin Mukherjee 2015-12-07 04:03:40 UTC

(In reply to kmak from comment #4)
> Do we have any updates on this?

You should be able to hear something from us in couple of days time. We haven't got a chance to look at it yet.

Comment 6 Vijay Bellur 2015-12-09 14:44:14 UTC

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#1) for review on master by Gaurav Kumar Garg (ggarg)

Comment 7 Gaurav Kumar Garg 2015-12-10 07:24:22 UTC

Hi kmak,

the  statedump + log files  file which you have attach is binary file its not in human readable formate. could you attach correct  statedump + log files  file.

one more thing

did you check memory leak of glusterd or glusterfsd or both 

could you give us information how did you check this memory leak.

Comment 8 kmak 2015-12-10 07:45:38 UTC

Created attachment 1104242 [details]
cmd_history.log

Comment 9 kmak 2015-12-10 07:47:04 UTC

Created attachment 1104243 [details]
command history log

Comment 10 kmak 2015-12-10 07:47:52 UTC

Created attachment 1104244 [details]
etc-glusterfs-glusterd.log

Comment 11 kmak 2015-12-10 07:48:21 UTC

Created attachment 1104245 [details]
glusterdump

Comment 12 kmak 2015-12-10 07:50:39 UTC

Hello, 
Yes sorry I uploaded them initially as a zip file.
Checking was done through top, process was glusterd.
glusterfs processes were using a stable amount of memory if I remember correctly.
This review which is mentioned above is correlated with my bug? 
(http://review.gluster.org/12927)

Comment 13 Vijay Bellur 2015-12-10 11:53:10 UTC

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#2) for review on master by Gaurav Kumar Garg (ggarg)

Comment 14 Gaurav Kumar Garg 2015-12-10 12:06:22 UTC

Hi kmak,

ya there is memory leak for sure but its not in MB i saw its in KB. I have created 6 distributed volume having 2 node in the cluster and after executing 1000 gluster command i saw only 10KB memory leak. Wondering how memory leak happening in MB in your system. 

Will further analysis this issue.

Comment 15 Roman Tereshonkov 2015-12-10 12:52:30 UTC

I observed that memory leak can be increased by continuous run
for ((N=0;N<2000;N++)) ; do gluster volume status all ; done

Making the rate of leak of the order 1M/min.


In addition, making some hard coding I saw the cases when dict_t refcount was -1.

Comment 16 Gaurav Kumar Garg 2015-12-10 13:09:40 UTC

just now i have executed loop from 0 to 2000 and saw that 22KB memory leak. will further check for that 22KB memory leak. how can your system cause 1M/min memory leak. will check it further.

Comment 17 bordas.csaba 2015-12-10 13:27:05 UTC

Hi! 

I have the same problem on my Debian (Jessie) system!

I have tested my cluster (Distributed-Replicated, 2x2 brick) by the following bash script from 5 clients simultaneously:

#!/bin/bash

while true;
do

        for (( c=1; c<=100; c++ ))
        do
                dd if=/dev/zero of=/mnt/vol_www/960b_${HOSTNAME}_$c.txt bs=960 count=1
        done

        sleep 3

        rm -f /mnt/vol_www/960b_$HOSTNAME*

        sleep 3

done

The consumed memory grows very quickly!
(More than 1MB/min)

Comment 18 bordas.csaba 2015-12-10 14:32:56 UTC

Hi!

I have an update!

I did "echo 2 > /proc/sys/vm/drop_caches" and the memory reduced back to "normal" level!

Also tried some similar like in comment 15 (gluster volume status)!

My test was:

on console 1: watch -n 0.1 'gluster volume status'
on console 2: watch -n 0.1 'gluster volume status vol_one clients'

It was running for 15-20 minutes!

After that I did "echo 2 > /proc/sys/vm/drop_caches" again, but
the memory did not reduced back to "normal" level at this time!

Comment 19 kmak 2015-12-15 10:45:04 UTC

Do you have any findings on this? 
Can you tell us what is the best way to debug it?

Comment 20 Atin Mukherjee 2015-12-15 12:25:02 UTC

(In reply to kmak from comment #19)
> Do you have any findings on this? 
> Can you tell us what is the best way to debug it?

We haven't found the RCA of this as its not reproducible at our end. Although we have identified few places in the code where it could lead to a small chunk of memory leaks but that's not that severe as you have reported. 

I've one more follow up question which is that you mentioned that you failed to change the version. Did you mean that you tried to downgrade?

Comment 21 kmak 2015-12-15 14:56:16 UTC

We tried the same configuration with GlusterFS 3.6.2, 3.7.5 and 3.7.6.
So, do you think that this could be a configuration issue?
We surely have a lot of network activity and a lot of nodes.

Comment 22 Atin Mukherjee 2015-12-16 03:39:11 UTC

(In reply to kmak from comment #21)
> We tried the same configuration with GlusterFS 3.6.2, 3.7.5 and 3.7.6.
> So, do you think that this could be a configuration issue?
> We surely have a lot of network activity and a lot of nodes.

If you mean to say you are hitting memory leak through out different versions then I am inline with what you mentioned that configuration could be a suspect. Can you upload the complete sosreport of the node where you see the memory leak? Are you sure that you have not modified any configurations in glusterd.vol ?

Comment 23 Roman Tereshonkov 2015-12-16 09:01:51 UTC

The problem is reproducible for glusterfs 3.7.6 using the simple two nodes cluster.
The proposed fix http://review.gluster.org/#/c/12927/ has no influence.

The setup:
1. any two nodes with network communication (node1, node2);
2. glusterfs 3.7.6 (./configure --prefix=/usr --libdir="/usr/lib64" --sysconfdir=/etc --localstatedir=/var && make && make install)
3. rdma is not used (is it critical or not ?). glusterd.vol: option transport-type socket
4. simple volume configuration for nodes:
     mkdir -p /brick/brick1
     gluster peer probe node1
     gluster volume create MYVOL replica 2 node1:/brick/brick1 node2:/brick/brick1 force
     gluster volume start MYVOL
     mount -t glusterfs node1:MYVOL /mnt/test (indeed I do not know if mounting is needed or not to see the bug, not tested yet)
5. for ((N=0;N<20000;N++)) ; do gluster volume status all ; done
6. "top" to observe glusterd RSS memory growing (and it grows in both nodes)


NB: If you are using only one node with this setup the problem is not reproduced.
    It look like two nodes communicating by network is the important precondition.

Comment 24 kmak 2015-12-17 09:32:41 UTC

Hello,
problem is verified also from my side (my colleague's Roman comment 23 is also valid) in my laptop (using RHEL 7.2) in two VMs (node1, node2)

Description of my setup:
1) Create two nodes connected to a common network.
2) Create two extra partitions, format them to ext4 and mount them to /data/brick/gv0/gv in both servers
3) Download glusterfs repo from here: http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/glusterfs-epel.repo & install glusterfs-server
4) gluster peer probe node1/node2 to both nodes
5) gluster volume create gv0 replica 2 node1:/data/brick/gv0/gv node2:/data/brick/gv0/gv 
6) gluster volume start gv0
7) Run "gluster volume status all" x 2000 times every 30sec
8) See glusterd memory (rss taken from ps aux) grow constantly with 1MB/min

I will attach a result log with the first runs.

Comment 25 kmak 2015-12-17 09:35:29 UTC

Created attachment 1106654 [details]
Results file showing glusterd memory increase

Comment 26 Roman Tereshonkov 2015-12-22 14:42:46 UTC

Created attachment 1108655 [details]
Fix glusterd 3.7.6 memory leaks in the serialized buffer.

Fix glusterd 3.7.6 memory leaks in the serialized buffer.

Memory is allocated by dict_allocate_and_serialize function and
pointer is assigned to req.dict.dict_val variable.
On exit this memory has to be freed.

These leaks can be observed by command usage:
glusterd volume status all
The order of leaks ~ 2K per command run.

Comment 27 Roman Tereshonkov 2015-12-22 14:52:29 UTC

The attached patch fixes the problem partially.
It reduces the memory leaks by about 5 times.
I did not observe any visible leaks on host sending "glusterd volume status all" command.
But the receiving host still has some small leaks.

I suspect the problem is somewhere in glusterd-op-sm.c file.
The volume status command initiates two transactions. 
For each transaction glusterd_set_txn_opinfo function is called and 
opinfo_obj = GF_CALLOC (1, sizeof(glusterd_txn_opinfo_obj), gf_common_mt_txn_opinfo_obj_t);
memory is allocated.
On finnishing glusterd_clear_txn_opinfo function is called to free allocated memory. But it happens only once for the second transaction.
So memory allocated for the first transaction is kept unallocated.

Can anybody check this and review?

Comment 28 Gaurav Kumar Garg 2015-12-22 18:22:09 UTC

Hi roman,

Thanks for your efforts. 

i will check it and get back to you.

~Gaurav

Comment 29 Gaurav Kumar Garg 2015-12-23 07:37:08 UTC

Hi roman,

Thank you for this finding. That actually solve the memory leak problem. i have tested it and find out that there are only few byte of memory leak after this patch. 

regarding your suspension in glusterd-op-sm.c file volume status is initiate transaction and call glusterd_set_txn_opinfo function and it call the glusterd_clear_txn_opinfo same time.

Comment 30 Vijay Bellur 2015-12-23 08:46:47 UTC

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#3) for review on master by Gaurav Kumar Garg (ggarg)

Comment 31 Roman Tereshonkov 2015-12-23 08:48:04 UTC

> regarding your suspension in glusterd-op-sm.c file volume status is initiate transaction and call glusterd_set_txn_opinfo function and it call the glusterd_clear_txn_opinfo same time.

According to the logs glusterd_clear_txn_opinfo is not called for the first transacation.
This chunk of log created on local host by running volume status on remote host.
As you can see "cleared opinfo" is called only for the second transaction.
What is your case. Do you see two "cleared opinfo" in your log?

[2015-12-22 11:09:01.321481] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.328820] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.343665] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.356197] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.366702] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.371448] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.382440] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.433093] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.440280] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.452016] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.468647] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.469808] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.472555] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.477988] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.482711] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.484789] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.496658] D [MSGID: 0] [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.504998] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.509757] D [MSGID: 0] [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.511833] D [MSGID: 0] [glusterd-op-sm.c:360:glusterd_clear_txn_opinfo] 0-management: Successfully cleared opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556

Comment 32 Vijay Bellur 2015-12-23 08:54:43 UTC

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#4) for review on master by Gaurav Kumar Garg (ggarg)

Comment 33 Gaurav Kumar Garg 2015-12-23 09:18:31 UTC

(In reply to Roman Tereshonkov from comment #31)
> > regarding your suspension in glusterd-op-sm.c file volume status is initiate transaction and call glusterd_set_txn_opinfo function and it call the glusterd_clear_txn_opinfo same time.
> 
> According to the logs glusterd_clear_txn_opinfo is not called for the first
> transacation.
> This chunk of log created on local host by running volume status on remote
> host.
> As you can see "cleared opinfo" is called only for the second transaction.
> What is your case. Do you see two "cleared opinfo" in your log?
> 
> [2015-12-22 11:09:01.321481] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.328820] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.343665] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.356197] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.366702] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.371448] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.382440] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.433093] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.440280] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.452016] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.468647] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.469808] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.472555] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.477988] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.482711] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.484789] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.496658] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.504998] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.509757] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.511833] D [MSGID: 0]
> [glusterd-op-sm.c:360:glusterd_clear_txn_opinfo] 0-management: Successfully
> cleared opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556

Hi roman,

i my test case was using gdb. i just put glusterd process in gdb and put a break point in these two function that you have mentioned (glusterd_set_txn_opinfo and glusterd_clear_txn_opinfo) and executed gluster volume status command. i saw that both function called same number of time.

Comment 34 Vijay Bellur 2015-12-23 09:27:24 UTC

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#5) for review on master by Gaurav Kumar Garg (ggarg)

Comment 35 Roman Tereshonkov 2015-12-23 09:42:06 UTC

> i my test case was using gdb. i just put glusterd process in gdb and put a break point in these two function that you have mentioned (glusterd_set_txn_opinfo and glusterd_clear_txn_opinfo) and executed gluster volume status command. i saw that both function called same number of time.


If it is not confidential can you attach the chunk of debug log /var/log/glusterfs/etc-glusterfs-glusterd.vol.log (--log-level=DEBUG) from both local and remote hosts where you make one run of the command "gluster volume status"?

Comment 36 Gaurav Kumar Garg 2015-12-23 12:29:43 UTC

Hi roman,

yes i still saw some memory leak at remote side. there is no memory leak at originator node side. rca seems to be set_txn_opinfo are calling more number of time compare to clear_txn_opinfo. will update you about it soon.

Comment 37 Gaurav Kumar Garg 2015-12-24 06:28:27 UTC

hi roman,

there is one update regarding this.

if by looking into glusterd log's you are saying that memory allocated by glusterd_set_txn_opinfo call too many number of time compare to glusterd_clear_txn_opinfo then its wrong. its true that glusterd_set_txn_opinfo calling more number of time at remote node compare to glusterd_clear_txn_opinfo but it doesn't mean that every time its allocate memory. from the function (glusterd_set_txn_opinfo) you can see that if ret is non zero (means its unable to get txn opinfo from the dictionary) then only it will allocate memory and it happen only one's at the starting of transaction. 


if you apply http://review.gluster.org/12927 patch at all the node in the cluster then you will see only few byte of memory leak. 

i am checking it further why few byte of memory leak happening. will update you further.

Comment 38 Roman Tereshonkov 2015-12-28 09:06:28 UTC

About http://review.gluster.org/#/c/12927/

Why have you moved GF_FREE (req.dict.dict_val) before "out:"?
If error comes after GD_ALLOC_COPY_UUID then the allocated memory leaks.
In my patch proposed it was after "out:" statement.

Comment 39 Vijay Bellur 2015-12-29 06:42:14 UTC

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#6) for review on master by Gaurav Kumar Garg (ggarg)

Comment 40 Roman Tereshonkov 2015-12-29 12:35:23 UTC

> if by looking into glusterd log's you are saying that memory allocated by glusterd_set_txn_opinfo call too many number of time compare to glusterd_clear_txn_opinfo then its wrong. its true that glusterd_set_txn_opinfo calling more number of time at remote node compare to glusterd_clear_txn_opinfo but it doesn't mean that every time its allocate memory. from the function (glusterd_set_txn_opinfo) you can see that if ret is non zero (means its unable to get txn opinfo from the dictionary) then only it will allocate memory and it happen only one's at the starting of transaction. 


By transaction I mean a sequence of commands with the same transaction id.

As I mentioned earlier every run of "volume status all" generates two different transactions with different transaction ids.
glusterd_set_txn_opinfo is called with "transaction id" as argument and allocates opinfo every time when there is no such transaction opinfo in dictionary.
glusterd_clear_txn_opinfo is called with "transaction id" as argument and frees opinfo with such id from dictionary glusterd_txn_opinfo.

In our case the glusterd_clear_txn_opinfo function is called only once with "transaction id" corresponding to the second transaction.

As a provement you can add "dict_dump_to_log(priv->glusterd_txn_opinfo);" at the end of function glusterd_clear_txn_opinfo to see the content of the dictionary.
You will see that after every run of "volume status all" the dictionary glusterd_txn_opinfo grows keeping all unallocated opinfo identified by the every first transaction id.


By the way if you run "volume status VOLNAME" with existed VOLNAME then only one transaction is generated and it looks like no memory leaks.

Comment 41 Roman Tereshonkov 2015-12-30 11:29:56 UTC

It looks like it is a functional problem.

The issued command "gluster volume status all" generates a series of transaction identified by different ids.
The very first one is to get the volume names supported by remote host glusterfs server.
The others transaction are issued per each volume name and include locking/unlocking phases.
And it is the unlocking phase that is the marker that the transaction ended and all allocated resources are to be free.
There is no other code pathway to call glusterd_clear_txn_opinfo function except unlock or unlock_all pathways.
Thus, there is no any way for the very first transaction to know when it is ended to call glusterd_clear_txn_opinfo and free opinfo.

This is my analysis which is needed to be confirmed by you.
Logically it is different issue now from what we have already fixed.

Should I open new bug for this functional problem?

Do you see any simple way to fix it?

Comment 42 Gaurav Kumar Garg 2015-12-30 11:37:05 UTC

Hi Roman, 

ya its a functional problem for "#gluster volume status all" command. but it need to be fixed. ya analysis is correct. will update you about it. no need to raise separate bug as of this as of now.

Comment 43 kmak 2016-02-02 14:23:02 UTC

Hello,
Is there any update for this bug?
Do you have implemented any fix for it and in which version?
Thanks

Comment 44 Gaurav Kumar Garg 2016-02-04 10:26:19 UTC

Hi Kmak,

patch http://review.gluster.org/12927 fix most of the memory leak. but still there is some memory leak. Its seems to be in our op-sm framework. This patch still not available in latest GlusterFS version.

Comment 45 Gaurav Kumar Garg 2016-02-04 10:26:53 UTC

Hi Kmak,

patch http://review.gluster.org/12927 fix most of the memory leak. but still there is some memory leak. Its seems to be in our op-sm framework. This patch still not available in latest GlusterFS version.

Comment 46 Vijay Bellur 2016-02-22 11:27:38 UTC

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in glusterd) posted (#7) for review on master by Gaurav Kumar Garg (ggarg)

Comment 47 Roman Tereshonkov 2016-02-22 11:33:39 UTC

Has that two lines memory leak fix patch been accepted into upstream 3.7?

What is the status of the functional issue based memory leak. Any progress to fix "gluster volume status all" issue?

Comment 48 Gaurav Kumar Garg 2016-02-23 06:52:44 UTC

Hi roman, 

patch is still in review. It have not merged yet to upstream.

Comment 49 Vijay Bellur 2016-02-24 04:19:15 UTC

COMMIT: http://review.gluster.org/12927 committed in master by Jeff Darcy (jdarcy) 
------
commit e38bf1bdeda3c7a89be3193ad62a72b9139358dd
Author: Gaurav Kumar Garg <garg.gaurav52>
Date:   Wed Dec 9 20:12:17 2015 +0530

    glusterd: fixing few memory leak in glusterd
    
    Current glusterd code base having memory leak. This is because of
    memory allocate by dict_allocate_and_serialize function in
    "gd_syncop_mgmt_v3_lock" and "gd_syncop_mgmt_v3_unlock"
    function is not freeing up meory upon exit.
    
    Fix is to free the memory after exit of the above function.
    
    Thanx Carlos and Roman for finding out the issue and fix.
    
    Change-Id: Id67aa794c84969830ca7ea8c2374f80c64d7a639
    BUG: 1287517
    Signed-off-by: Gaurav Kumar Garg <ggarg>
    Signed-off-by: Carlos Chinea <carlos.chinea>
    Signed-off-by: Roman Tereshonkov <roman.tereshonkov>
    Reviewed-on: http://review.gluster.org/12927
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 50 Roman Tereshonkov 2016-02-25 09:53:58 UTC

Thanks.

What is the status of fixing the functional problem in interface?
I mean "gluster volume status all" issue.

Comment 51 Gaurav Kumar Garg 2016-02-29 09:14:02 UTC

Hi Roman, Will check it out that part too.

Thanks,
~Gaurav

Comment 52 Vijay Bellur 2016-03-10 02:27:06 UTC

REVIEW: http://review.gluster.org/13660 (glusterd: fixing few memory leak in glusterd) posted (#1) for review on master by Gaurav Kumar Garg (ggarg)

Comment 53 Vijay Bellur 2016-03-10 13:55:02 UTC

COMMIT: http://review.gluster.org/13660 committed in master by Jeff Darcy (jdarcy) 
------
commit a007fdf549260d0b146184fa85ca7029560db8c5
Author: Gaurav Kumar Garg <garg.gaurav52>
Date:   Thu Mar 10 07:51:48 2016 +0530

    glusterd: fixing few memory leak in glusterd
    
    While freeing memory currently glusterd is not freeing correct
    memory. this might result in some serious situation.
    
    With this fix glusterd will free correct memory location.
    
    Change-Id: Ide9c33a2ec5822b560e9e2dfcb6a0b442fc97047
    BUG: 1287517
    Signed-off-by: Gaurav Kumar Garg <ggarg>
    Reviewed-on: http://review.gluster.org/13660
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Vijay Bellur <vbellur>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 54 Vijay Bellur 2016-03-17 06:48:41 UTC

REVIEW: http://review.gluster.org/13758 (glusterd: fixing few memory leak in glusterd, at remote node) posted (#1) for review on master by Gaurav Kumar Garg (ggarg)

Comment 55 Vijay Bellur 2016-03-22 05:23:16 UTC

REVIEW: http://review.gluster.org/13758 (glusterd: fixing few memory leak in glusterd, at remote node) posted (#2) for review on master by Gaurav Kumar Garg (ggarg)

Comment 56 Vijay Bellur 2016-05-13 10:36:15 UTC

REVIEW: http://review.gluster.org/13758 (glusterd: fixing few memory leak in glusterd, at remote node) posted (#3) for review on master by Gaurav Kumar Garg (ggarg)

Comment 57 Vijay Bellur 2016-05-16 09:51:29 UTC

REVIEW: http://review.gluster.org/13758 (glusterd: fixing few memory leak in glusterd, at remote node) posted (#4) for review on master by Gaurav Kumar Garg (ggarg)

Comment 58 Vijay Bellur 2016-05-20 06:08:52 UTC

REVIEW: http://review.gluster.org/13758 (glusterd: fixing few memory leak in glusterd, at remote node) posted (#5) for review on master by Gaurav Kumar Garg (ggarg)

Comment 59 Vijay Bellur 2016-05-30 11:40:19 UTC

REVIEW: http://review.gluster.org/13758 (glusterd: fixing few memory leak in glusterd, at remote node) posted (#6) for review on master by Gaurav Kumar Garg (ggarg)

Comment 61 Niels de Vos 2016-06-16 13:47:58 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.