Bug 1549473

Summary: possible memleak in glusterfsd process with brick multiplexing on
Product: [Community] GlusterFS Reporter: Mohit Agrawal <moagrawa>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.12CC: amukherj, bmekala, bugs, kramdoss, nbalacha, nchilaka, pprakash, rcyriac, rhinduja, rhs-bugs, storage-qa-internal, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: brick-multiplexing
Fixed In Version: glusterfs-3.12.8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1535281 Environment:
Last Closed: 2018-04-24 06:53:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1535281, 1544090    
Bug Blocks:    

Description Mohit Agrawal 2018-02-27 08:00:12 UTC
+++ This bug was initially created as a clone of Bug #1535281 +++

Description of problem:
With brick multiplexing on, when volume creation and deletion was run continuously for ~12 hours, glusterfsd process on each of the three nodes consumes close to 14gb of memory with a single volume in the system. This is quite high.

Please note that throughout the test, heketidb volume is not deleted and hence the same brick process remain throughout the test.

Version-Release number of selected component (if applicable):
sh-4.2# rpm -qa  | grep 'gluster'
glusterfs-libs-3.8.4-54.el7rhgs.x86_64
glusterfs-3.8.4-54.el7rhgs.x86_64
glusterfs-api-3.8.4-54.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.el7rhgs.x86_64
glusterfs-server-3.8.4-54.el7rhgs.x86_64
gluster-block-0.2.1-14.el7rhgs.x86_64


How reproducible:
Always

Steps to Reproduce:
1. on a CNS setup, run the following script for 12 hours.

while true; do for i in {1..5}; do heketi-cli volume create --size=1; done; heketi-cli volume list | awk '{print $1}' | cut -c 4- >> vollist; while read i; do heketi-cli volume delete $i; sleep 2; done<vollist; rm vollist; done

Actual results:
glusterfsd process consumes ~14 gb with 1 volume

Expected results:
typically, glusterfsd would consume < 1gb for a volume

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-01-16 22:20:26 EST ---

This bug is automatically being proposed for the release of Red Hat Gluster Storage 3 under active development and open for bug fixes, by setting the release flag 'rhgs‑3.4.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from krishnaram Karthick on 2018-01-16 23:15:35 EST ---

logs are available here --> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1535281/

--- Additional comment from Atin Mukherjee on 2018-01-16 23:28:54 EST ---

A probable RCA:

When a brick instance is detached from a brick process, the individual xlators and their respective memory allocations should be freed. Even though some of the xlators do have their destructor functions ( fini () ) in place they are not invoked AFAIK. So I believe that's when even though a volume is deleted which eventually detaches its respective brick instance(s) from the existing brick process all the xlators and their respective allocated memory are not freed up and when we end up doing this many number of detach operations, the leaks look to be quite significant.

The effort to make every fini () handlers of the respective xlators to work properly might be quite significant and we'd definitely need to assess it as with brick multiplexing the impact is quite severe here.

I've assigned this bug to Mohit to begin with estimating the effort required here. I believe there'd be lot of collaboration and effort required from the owner of the individual xlators.

--- Additional comment from Prasanth on 2018-01-17 17:47:54 EST ---

Karthick, please have a clone of this BZ created against CNS for tracking purpose and propose it for the next immediate release.

--- Additional comment from krishnaram Karthick on 2018-01-18 01:09:11 EST ---

(In reply to Prasanth from comment #4)
> Karthick, please have a clone of this BZ created against CNS for tracking
> purpose and propose it for the next immediate release.

Done.

--- Additional comment from nchilaka on 2018-01-19 05:08:51 EST ---

added this case(as mentioned in description) to rhgs brickmux (non containerized) test plan

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-01-24 00:09:13 EST ---

This bug is automatically being provided 'pm_ack+' for the release flag 'rhgs‑3.4.0', having been appropriately marked for the release, and having been provided ACK from Development and QE

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-01-24 07:26:33 EST ---

Since this bug has has been approved for the RHGS 3.4.0 release of Red Hat Gluster Storage 3, through release flag 'rhgs-3.4.0+', and through the Internal Whiteboard entry of '3.4.0', the Target Release is being automatically set to 'RHGS 3.4.0'

Comment 1 Worker Ant 2018-03-05 06:25:16 UTC
REVIEW: https://review.gluster.org/19666 (glusterfsd: Memleak in glusterfsd process while  brick mux is on) posted (#1) for review on release-3.12 by MOHIT AGRAWAL

Comment 2 Worker Ant 2018-04-06 12:47:59 UTC
COMMIT: https://review.gluster.org/19666 committed in release-3.12 by "jiffin tony Thottan" <jthottan> with a commit message- glusterfsd: Memleak in glusterfsd process while  brick mux is on

Problem: At the time of stopping the volume while brick multiplex is
         enabled memory is not cleanup from all server side xlators.

Solution: To cleanup memory for all server side xlators call fini
          in glusterfs_handle_terminate after send GF_EVENT_CLEANUP
          notification to top xlator.

> BUG: 1544090
> Signed-off-by: Mohit Agrawal <moagrawa>
> (cherry picked from commit 7c3cc485054e4ede1efb358552135b432fb7047a)

>Note: Run all test-cases in separate build (https://review.gluster.org/19574)
>      with same patch after enable brick mux forcefully, all test cases are
>      passed.

BUG: 1549473
Signed-off-by: Mohit Agrawal <moagrawa>
Change-Id: Ia10dc7f2605aa50f2b90b3fe4eb380ba9299e2fc

Comment 3 Jiffin 2018-04-24 06:53:38 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.8, please open a new bug report.

glusterfs-3.12.8 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-devel/2018-April/054749.html
[2] https://www.gluster.org/pipermail/gluster-users/