Bug 1179136

Summary: glusterd: Gluster rebalance status returns failure
Product: [Community] GlusterFS Reporter: Atin Mukherjee <amukherj>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: alex.smith, amukherj, bugs, david.macdonald, ggarg, gluster-bugs, kaushal, lmohanty, ndevos, nlevinki, nsathyan, rabhat, rhs-bugs, rhsc-qe-bugs, sasundar, ssampat, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1154635 Environment:
Last Closed: 2016-01-08 09:18:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1130158, 1154635    
Bug Blocks: 1184460    

Comment 1 Anand Avati 2015-01-06 09:55:31 UTC
REVIEW: http://review.gluster.org/9393 (glusterd : release cluster wide locks in op-sm during failures) posted (#1) for review on release-3.6 by Atin Mukherjee (amukherj)

Comment 2 Anand Avati 2015-01-11 11:57:31 UTC
REVIEW: http://review.gluster.org/9393 (glusterd : release cluster wide locks in op-sm during failures) posted (#2) for review on release-3.6 by Atin Mukherjee (amukherj)

Comment 3 Anand Avati 2015-03-04 07:31:12 UTC
COMMIT: http://review.gluster.org/9393 committed in release-3.6 by Raghavendra Bhat (raghavendra) 
------
commit b646678334f4fab78883ecc1b993ec0cb1b49aba
Author: Atin Mukherjee <amukherj>
Date:   Mon Oct 27 12:12:03 2014 +0530

    glusterd : release cluster wide locks in op-sm during failures
    
    glusterd op-sm infrastructure has some loophole in handing error cases in
    locking/unlocking phases which ends up having stale locks restricting
    further transactions to go through.
    
    This patch still doesn't handle all possible unlocking error cases as the
    framework neither has retry mechanism nor the lock timeout. For eg - if
    unlocking fails in one of the peer, cluster wide lock is not released and
    further transaction can not be made until and unless originator node/the node
    where unlocking failed is restarted.
    
    Following test cases were executed (with the help of gdb) after applying this
    patch:
    
    * RPC timesout in lock cbk
    * Decoding of RPC response in lock cbk fails
    * RPC response is received from unknown peer in lock cbk
    * Setting peerinfo in dictionary fails while sending lock request for first peer
      in the list
    * Setting peerinfo in dictionary fails while sending lock request for other
      peers
    * Lock RPC could not be sent for peers
    
    For all above test cases the success criteria is not to have any stale locks
    
    Patch link : http://review.gluster.org/9012
    
    Change-Id: Ia1550341c31005c7850ee1b2697161c9ca04b01a
    BUG: 1179136
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/9012
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Kaushal M <kaushal>
    Reviewed-on: http://review.gluster.org/9393
    Reviewed-by: Raghavendra Bhat <raghavendra>