Bug 1386658

Summary:	Possible memory leak, memory consumption is not reduced even after rm -rf
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	core	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED NOTABUG	QA Contact:	Prasanth <pprakash>
Severity:	low	Docs Contact:
Priority:	medium
Version:	rhgs-3.2	CC:	moagrawa, pprakash, rhs-bugs, sasundar, sheggodu
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-06-30 10:48:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1647277
Bug Blocks:

Description Nag Pavan Chilakam 2016-10-19 12:03:09 UTC

Description of problem:
=======================
I enabled md-cache on a distrep vol.
I mounted the volume on two different clients and created about hundred thousand zero byte files.
When I do a lookup, the residual memory increased on using top command for the mount process.

I then deleted all the files and I find that the residual memory is not at all decreasing.

The memory usage consumption has gone up to 4% as below

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
19001 root      20   0 1047712 293652   3684 S   0.0  3.7  10:09.42 gluster+

I even did a lookup after about 15 min to see atleast then if memory will be freed, but it was not getting freed at all
I expect that on removal of files, the cache must be invalidated but looks like the memory is not freed which has been alloted. However the lookup doesnt display anything



 
Volume Name: distrep
Type: Distributed-Replicate
Volume ID: 69a1f685-5024-4b1f-a6bd-81a350f83da9
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.86:/rhs/brick1/distrep
Brick2: 10.70.35.9:/rhs/brick1/distrep
Brick3: 10.70.35.153:/rhs/brick1/distrep
Brick4: 10.70.35.79:/rhs/brick1/distrep
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 60
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 60



BUILD:
It was taken from what was mentioned in http://etherpad.corp.redhat.com/md-cache-3-2


nfs-ganesha-gluster-2.4.0-2.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-api-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-events-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-libs-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-cli-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-server-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-rdma-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-2.26.git0a405a4.el7rhgs.x86_64
python-gluster-3.8.4-2.26.git0a405a4.el7rhgs.noarch
glusterfs-fuse-3.8.4-2.26.git0a405a4.el7rhgs.x86_64

Comment 2 Nag Pavan Chilakam 2016-10-19 12:07:13 UTC

in this way complete memory can be consumed leading to mount process crashed.

When I started the test the residual memory was at 51604KB and on first lookup(post creation of 1lakh files) went to 190932
Even after delete finally it was at 293652 


Also, note that i noticed that when i re issue lookups after larger time gaps say about 15min, I see that the residual memory shooting up, this too could be a problem that needs addressing

Comment 3 Poornima G 2016-10-27 08:30:56 UTC

From my analysis,

I could see the memory usage increase as the files get created, but when the files are removed, md-cache cleans up the cache. But the memory usage (as shown in top) doesn't reduce greatly. This is the case with md-cache enabled or disabled.

There surely is some leak, but it is not by md-cache. We need to debug it further to identify which component is consuming the remaining memory. From the first look couldn't find it out from statedump.

Comment 4 Poornima G 2016-11-04 05:04:13 UTC

As mentioned in Comment #3, i could reproduce the leak, but is seen without md-cache as well. Could you please confirm?

Comment 5 Nag Pavan Chilakam 2016-11-10 13:30:36 UTC

I agree this can be seen even with md-cache
However, shouldn't we be clearing the cache atleast with md-cache enabled when upcalls are triggered. Can't we leverage that intelligence?

Comment 6 Poornima G 2016-11-17 11:23:47 UTC

(In reply to nchilaka from comment #5)
> I agree this can be seen even with md-cache
> However, shouldn't we be clearing the cache atleast with md-cache enabled
> when upcalls are triggered. Can't we leverage that intelligence?

md-cache already clears the cache that it allocated as a part of unlink. We do not require upcall to clear the cache on unlink in any component, as it is on the same mount. I guess this is a trivial leak, not sure which component.

Comment 12 Nag Pavan Chilakam 2018-11-02 10:33:01 UTC

changing summary as it may not have to do with mdcache, based on above comments

Comment 13 Poornima G 2018-11-19 05:04:04 UTC

Requires re-testing with the latest release, as lots of memory leaks have gone in from 3.2 to now.

Comment 14 Poornima G 2018-11-19 06:49:58 UTC

As mentioned in the previous comments, its not related to md-cache, hence changing the component.

Comment 15 Nag Pavan Chilakam 2018-11-24 16:31:40 UTC

(In reply to Poornima G from comment #13)
> Requires re-testing with the latest release, as lots of memory leaks have
> gone in from 3.2 to now.

retested on 3.4.2 3.12.2-29 build, still the problem exists.


[root@dhcp35-64 ~]# cat test.log 
below was taken while writes were going on
Fri Nov 23 20:22:35 IST 2018
13456 root      20   0  642756 182528   4140 S   0.0  4.7   5:52.45 glusterfs
Below was taken after doing a find * and ls -lRt 

Sat Nov 24 21:29:52 IST 2018
13456 root      20   0  839364 390532   4156 S   0.0 10.1  11:05.12 glusterfs
now going to do rm -rf
Sat Nov 24 21:41:23 IST 2018
rm -rf complete and filesystem empty
Sat Nov 24 21:41:23 IST 2018
13456 root      20   0  810692 365248   4204 S   0.0  9.4  12:55.24 glusterfs
####
rechecking after about 15min
Sat Nov 24 21:58:16 IST 2018
13456 root      20   0  810692 365248   4204 S   0.0  9.4  12:55.28 glusterfs
####
rechecking after about 15min
Sat Nov 24 21:58:19 IST 2018
13456 root      20   0  810692 365248   4204 S   0.0  9.4  12:55.28 glusterfs

Comment 17 Amar Tumballi 2019-04-24 17:19:45 UTC

Need a test with 3.4.4 release / 3.5.0 builds. Mainly because we have fuse inode garbage collection feature now.

Comment 21 Nag Pavan Chilakam 2019-06-19 12:25:09 UTC

sosreports and client statedumps @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1386658/reproducer-on-rhgs350-comment19/client/dhcp47-147.lab.eng.blr.redhat.com/

Comment 22 Mohit Agrawal 2020-02-14 14:22:02 UTC

Hi Nag,

 Are we still seeing the issue in 3.5.1?

Thanks,
Mohit Agrawal

Comment 23 Nag Pavan Chilakam 2020-02-24 06:04:02 UTC

(In reply to Mohit Agrawal from comment #22)
> Hi Nag,
> 
>  Are we still seeing the issue in 3.5.1?
> 
> Thanks,
> Mohit Agrawal

I Mohit, yes I saw in 3.5.1 too

Comment 32 SATHEESARAN 2021-06-30 10:48:09 UTC

This issue is not reproducible with RHGS 3.5.4 on RHEL7.
Validation was also done on RHEL 8 based RHGS 3.5.4.

Based on these facts, closing this bug