Bug 1729108

Summary:	Memory leak in glusterfsd process
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nithya Balachandran <nbalacha>
Component:	core	Assignee:	Nithya Balachandran <nbalacha>
Status:	CLOSED ERRATA	QA Contact:	Upasana <ubansal>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.5	CC:	abhishpaliwal, amukherj, bugs, ccalhoun, dyocum, hgowtham, liguang.lee6, nbalacha, nchilaka, pasik, pdhange, rhs-bugs, ryan, sburke, storage-qa-internal, vdas
Target Milestone:	---
Target Release:	RHGS 3.5.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-6.0-8	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1729107
Clones:	1759366 (view as bug list)		Environment:
Last Closed:	2019-10-30 12:22:15 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1718734, 1729107, 1768726, 1768742
Bug Blocks:	1696809, 1759366

Description Nithya Balachandran 2019-07-11 12:27:20 UTC

+++ This bug was initially created as a clone of Bug #1729107 +++

+++ This bug was initially created as a clone of Bug #1718734 +++

Description of problem:

We are seeing the memory leak in glusterfsd process when writing and deleting the specific file in some interval

Version-Release number of selected component (if applicable): Glusterfs 5.4


How reproducible:

Here is the Setup details and test which we are doing as below:


One client, two gluster Server.
The client is writing and deleting one file each 15 minutes by script test_v4.15.sh.

IP
Server side:
128.224.98.157 /gluster/gv0/
128.224.98.159 /gluster/gv0/

Client side:
128.224.98.160 /gluster_mount/

Server side:
gluster volume create gv0 replica 2 128.224.98.157:/gluster/gv0/ 128.224.98.159:/gluster/gv0/ force
gluster volume start gv0

root@128:/tmp/brick/gv0# gluster volume info

Volume Name: gv0
Type: Replicate
Volume ID: 7105a475-5929-4d60-ba23-be57445d97b5
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 128.224.98.157:/gluster/gv0
Brick2: 128.224.98.159:/gluster/gv0
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

exec script: ./ps_mem.py -p 605 -w 61 > log
root@128:/# ./ps_mem.py -p 605
Private + Shared = RAM used Program
23668.0 KiB + 1188.0 KiB = 24856.0 KiB glusterfsd
---------------------------------
24856.0 KiB
=================================


Client side:
mount -t glusterfs -o acl -o resolve-gids 128.224.98.157:gv0 /gluster_mount


We are using the below script write and delete the file.

test_v4.15.sh

Also the below script to see the memory increase whihle the script is above script is running in background.

ps_mem.py

I am attaching the script files as well as the result got after testing the scenario.


Actual results: Memory leak is present 


Expected results: Leak should not be there


Additional info: Please see the attached file for more details also attaching the statedumps

--- Additional comment from Abhishek on 2019-06-10 06:19 UTC ---



--- Additional comment from Abhishek on 2019-06-10 06:19 UTC ---



--- Additional comment from Abhishek on 2019-06-10 06:20 UTC ---



--- Additional comment from Abhishek on 2019-06-10 06:23 UTC ---



--- Additional comment from Abhishek on 2019-06-10 06:23 UTC ---



--- Additional comment from Abhishek on 2019-06-10 06:24:50 UTC ---

Attached are some statedumps taken on the gluster fs server. An initial dump, then one from an hour or so later, then one from another 3 hours or so. I believe we have looked before at the statedumps though and not seen any evidence there what is going wrong, but please double-check this.

I have a system running with gluster 5.4 setup in replicate mode. So an Active and a Passive server, and one client that has mounted the gluster volume, and simply writes and deletes a file every 15 minutes to the gluster volume. That is all that is going on.

What we see is that the memory usage for glusterfsd process is increasing slowly in a liner fashion. I am running a python script every minute to log the memory usage, and then plot the result on a graph. I attach the graph showing glusterfsd private, shared and total memory usage over time (some 78 days running). I also attach two screenshots from 'top' taken at various stages.

This is the graph during the one file every 15 minutes write test:

Please see the attachment for image.png



And BTW, if we 'hammer' the gluster volume with file writes/deletes in a much faster fashion, ie many files written/deleted every second, or even every minute, we see that the glusterfsd memory usage increases only for a very short period, then it levels off and stays level forever at around 35MB total. So there is clearly something different happening in the 'slow' file access case where the total is at nearly 200MB and still increasing. 

If we run Valgrind, we see that memory allocates are freed up when the process is ended, but the problem we have is that this will be on a system where gluster is up and running all the time. So there seems to be a problem that memory is dynamically allocated each time there is a write/read on the gluster volume, but it is not dynamically freed in runtime. The worry is at some point glusterfsd will completely use up all the memory on the system - might take a long time but this is not acceptable.

My steps are here:

root@board0:/tmp# gluster --version
glusterfs 5.4

root@board0:/tmp# gluster volume status gv0
Status of volume: gv0
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick board0:/export/sdb1/brick 49152 0 Y 1702
Brick board1:/export/sdb1/brick 49152 0 Y 1652
Self-heal Daemon on localhost N/A N/A Y 1725
Self-heal Daemon on board1 N/A N/A Y 1675
Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

root@board0:/tmp# jobs
[1]+ Running ./ps_mem.py -w 61 > /tmp/ps_mem.log & (wd: ~)

root@board0:/tmp# ps -ef | grep gluster
root 1608 1 0 May08 ? 00:00:04 /usr/sbin/glusterd -p /var/run/glusterd.pid
root 1702 1 0 May08 ? 00:00:14 /usr/sbin/glusterfsd -s board0 --volfile-id gv0.board0.export-sdb1-brick -p /var/run/gluster/vols/gv0/board0-export-sdb1-brick.pid -S /var/run/gluster/6c09da8ec6e017c8.socket --brick-name /export/sdb1/brick -l /var/log/glusterfs/bricks/export-sdb1-brick.log --xlator-option *-posix.glusterd-uuid=336dc4a8-1371-4366-b2f9-003c35e12ca1 --process-name brick --brick-port 49152 --xlator-option gv0-server.listen-port=49152
root 1725 1 0 May08 ? 00:00:03 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/7ae70daf2745f7d4.socket --xlator-option replicate.node-uuid=336dc4a8-1371-4366-b2f9-003c35e12ca1 --process-name glustershd
root 3115 1241 0 03:00 ttyS0 00:00:00 grep gluster 

This is the cmd used to create the gluster volume:

gluster volume create gv0 replica 2 board0:/export/sdb1/brick board1:/export/sdb1/brick

And on client I do like:

mount -t glusterfs board0:gv0 /mnt

and then just run the one file each 15 min test

./test_v4.sh

To get the data, I run like this after some time:

grep glusterfsd ps_mem.log | awk '{ print $1 "," $4 "," $7 }' > gluster54-glusterfsd.csv

Then plot the points in excel

--- Additional comment from Abhishek on 2019-06-12 14:53 UTC ---



--- Additional comment from Abhishek on 2019-06-12 14:54:43 UTC ---

Hi,

As requested we have checked tested the same on Glusterfs 6.1 and memory leak is present here as well.

Please check the "statedumps log for 6.1" in attachment for more details.

--- Additional comment from Abhishek on 2019-06-14 10:15:53 UTC ---

Hi Team,

is there any update on this?

Regards,
Abhishek

--- Additional comment from hari gowtham on 2019-06-14 11:17:32 UTC ---

(In reply to Abhishek from comment #9)
> Hi Team,
> 
> is there any update on this?
> 
> Regards,
> Abhishek

Hi Abhishek,

I'll take a look into it. 
This will take some time. 
Will update once I have made some progress.

Regards,
Hari.

--- Additional comment from  on 2019-06-19 16:02:37 UTC ---

We are also seeing this issue, with memory slowly increasing over time until all system resources (SWAP and RAM) are exhausted.

--- Additional comment from Worker Ant on 2019-06-21 05:25:29 UTC ---

REVIEW: https://review.gluster.org/22918 (Detach iot_worker to release its resources) posted (#1) for review on master by Liguang Li

--- Additional comment from Worker Ant on 2019-07-11 12:25:16 UTC ---

REVIEW: https://review.gluster.org/22918 (Detach iot_worker to release its resources) posted (#2) for review on master by N Balachandran

Comment 19 errata-xmlrpc 2019-10-30 12:22:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3249