Bug 1316178

Summary: changelog/rpc: Memory leak- rpc_clnt_t object is never freed
Product: [Community] GlusterFS Reporter: Kotresh HR <khiremat>
Component: rpcAssignee: Kotresh HR <khiremat>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1359363 (view as bug list) Environment:
Last Closed: 2017-03-08 09:19:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1359363, 1359364    

Description Kotresh HR 2016-03-09 15:11:59 UTC
Description of problem:
changelog rpc-clnt object is never freed and the memory is leaked.
Changelog should unref the rpc-clnt object after it's usage on DISCONNECT,
since the proper ref and unref of rpc-clnt object was not built up in rpc
layer, this was not done.

Version-Release number of selected component (if applicable):
3.7.9

How reproducible:
Always

Steps to Reproduce:
1. Setup geo-replication 
2. stop and start geo-replication multiple times will leak rpc_clnt object on brick process.


Actual results:
Memory leak

Expected results:
Process should not leak memory

Additional info:

Comment 1 Vijay Bellur 2016-03-09 15:40:52 UTC
REVIEW: http://review.gluster.org/13658 (changelog/rpc: Fix rpc_clnt_t mem leaks) posted (#1) for review on master by Kotresh HR (khiremat)

Comment 2 Vijay Bellur 2016-03-15 12:46:33 UTC
REVIEW: http://review.gluster.org/13658 (changelog/rpc: Fix rpc_clnt_t mem leaks) posted (#2) for review on master by Kotresh HR (khiremat)

Comment 3 Vijay Bellur 2016-03-20 12:59:14 UTC
REVIEW: http://review.gluster.org/13658 (changelog/rpc: Fix rpc_clnt_t mem leaks) posted (#3) for review on master by Kotresh HR (khiremat)

Comment 4 Mike McCune 2016-03-28 22:53:35 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 5 Vijay Bellur 2016-07-11 09:18:33 UTC
REVIEW: http://review.gluster.org/13658 (changelog/rpc: Fix rpc_clnt_t mem leaks) posted (#4) for review on master by Kotresh HR (khiremat)

Comment 6 Vijay Bellur 2016-07-13 12:25:25 UTC
REVIEW: http://review.gluster.org/13658 (changelog/rpc: Fix rpc_clnt_t mem leaks) posted (#5) for review on master by Kotresh HR (khiremat)

Comment 7 Vijay Bellur 2016-07-15 15:08:44 UTC
REVIEW: http://review.gluster.org/13658 (changelog/rpc: Fix rpc_clnt_t mem leaks) posted (#6) for review on master by Kotresh HR (khiremat)

Comment 8 Vijay Bellur 2016-07-21 07:12:42 UTC
REVIEW: http://review.gluster.org/13658 (changelog/rpc: Fix rpc_clnt_t mem leaks) posted (#7) for review on master by Kotresh HR (khiremat)

Comment 9 Vijay Bellur 2016-07-21 12:53:18 UTC
REVIEW: http://review.gluster.org/13658 (changelog/rpc: Fix rpc_clnt_t mem leaks) posted (#8) for review on master by Kotresh HR (khiremat)

Comment 10 Vijay Bellur 2016-07-22 06:05:28 UTC
REVIEW: http://review.gluster.org/13658 (changelog/rpc: Fix rpc_clnt_t mem leaks) posted (#9) for review on master by Kotresh HR (khiremat)

Comment 11 Vijay Bellur 2016-07-22 15:12:55 UTC
COMMIT: http://review.gluster.org/13658 committed in master by Jeff Darcy (jdarcy) 
------
commit 637ce9e2e27e9f598a4a6c5a04cd339efaa62076
Author: Kotresh HR <khiremat>
Date:   Mon Mar 7 11:45:07 2016 +0530

    changelog/rpc: Fix rpc_clnt_t mem leaks
    
    PROBLEM:
       1. Freeing up rpc_clnt object might lead to crashes. Well,
          it was not a necessity to free rpc-clnt object till now
          because all the existing use cases needs to reconnect
          back on disconnects. Hence timer code was not taking
          ref on rpc-clnt object.
    
          Glusterd had some use-cases that led to crash due to
          ping-timer and they fixed only those code paths that
          involve ping-timer.
    
          Now, since changelog has an use-case where rpc-clnt
          need to be freed up, we need to fix timer code to take
          refs
    
       2. In changelog, because of issue 1, only mydata was being
          freed which is incorrect. And there are races where
          rpc-clnt object would access the freed mydata which
          would lead to crashes.
    
          Since changelog xlator resides on brick side and is long
          living process, if multiple libgfchangelog consumers
          register to changelog and disconnect/reconnect mulitple
          times, it would result in leak of 'rpc-clnt' object
          for every connect/disconnect.
    
    SOLUTION:
       1. Handle ref/unref of 'rpc_clnt' structure in timer
          functions properly.
       2. In changelog, unref 'rpc_clnt' in RPC_CLNT_DISCONNECT
          after disabling timers and free mydata on RPC_CLNT_DESTROY.
    
    RPC SETUP IN CHANGELOG:
       1. changelog xlator initiates rpc server say 'changelog_rpc_server'
       2. libgfchangelog initiates one rpc server say 'libgfchangelog_rpc_server'
       3. libgfchangelog initiates rpc client and connects to 'changelog_rpc_server'
       4. In return changelog_rpc_server initiates a rpc client and connects back
          to 'libgfchangelog_rpc_server'
    
    REF/UNREF HANDLING IN TIMER FUNCTIONS:
    Let's say rpc clnt refcount = 1
       1. Take the ref before reigstering callback to timer queue
               >>>>  rpc_clnt_ref (say ref count becomes = 2)
       2. Register a callback to timer say 'callback1'
       3. If register fails:
               >>>> rpc_clnt_unref (ref count = 1)
       4. On timer expiration, 'callback1' gets called. So unref rpc clnt at the end
          in 'callback1'. This is corresponding to ref taken in step 1
               >>>> rpc_clnt_unref (ref count = 1)
       5. The cycle from step-1 to step-4 continues....until timer cancel event happens
       6. timer cancel of say 'callback1'
               If timer cancel fails:
                     Do nothing, Step-4 would have unrefd
               If timer cancel succeeds:
                     >>>> rpc_clnt_unref (ref count = 1)
    
    Change-Id: I91389bc511b8b1a17824941970ee8d2c29a74a09
    BUG: 1316178
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/13658
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 12 Kotresh HR 2017-03-08 09:19:22 UTC
v3.10.0 Contains the Fix