1154635 – glusterd: Gluster rebalance status returns failure

Bug 1154635 - glusterd: Gluster rebalance status returns failure

Summary: glusterd: Gluster rebalance status returns failure

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1130158
Blocks:	1179136
TreeView+	depends on / blocked

Reported:	2014-10-20 11:38 UTC by Atin Mukherjee
Modified:	2015-05-14 17:44 UTC (History)
CC List:	15 users (show)
Fixed In Version:	glusterfs-3.7.0
Clone Of:	1130158
Clones:	1179136 (view as bug list)
Environment:
Last Closed:	2015-05-14 17:28:02 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Anand Avati 2014-10-20 12:14:03 UTC

REVIEW: http://review.gluster.org/8945 (glusterd: error handling in op-sm) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 2 Anand Avati 2014-10-21 05:38:10 UTC

REVIEW: http://review.gluster.org/8945 (glusterd: error handling in op-sm) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 3 Anand Avati 2014-10-30 11:47:29 UTC

REVIEW: http://review.gluster.org/9012 (glusterd : release locks in op-sm for allmost all possible cases) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 4 Anand Avati 2014-11-04 07:18:42 UTC

REVIEW: http://review.gluster.org/9012 (glusterd : release cluster wide locks in op-sm during failures) posted (#2) for review on master by Atin Mukherjee (amukherj)

Comment 5 Anand Avati 2014-11-04 14:55:10 UTC

REVIEW: http://review.gluster.org/9043 (glusterd: Porting rebalance command to mgmt_v3 framework) posted (#1) for review on master by Avra Sengupta (asengupt)

Comment 6 Anand Avati 2014-11-05 11:29:56 UTC

REVIEW: http://review.gluster.org/9012 (glusterd : release cluster wide locks in op-sm during failures) posted (#3) for review on master by Atin Mukherjee (amukherj)

Comment 7 Anand Avati 2014-11-05 15:36:47 UTC

REVIEW: http://review.gluster.org/9012 (glusterd : release cluster wide locks in op-sm during failures) posted (#4) for review on master by Atin Mukherjee (amukherj)

Comment 8 Anand Avati 2014-11-06 08:51:55 UTC

REVIEW: http://review.gluster.org/9012 (glusterd : release cluster wide locks in op-sm during failures) posted (#5) for review on master by Atin Mukherjee (amukherj)

Comment 9 Anand Avati 2014-11-06 09:46:34 UTC

REVIEW: http://review.gluster.org/9043 (glusterd: Porting rebalance command to mgmt_v3 framework) posted (#2) for review on master by Avra Sengupta (asengupt)

Comment 10 Anand Avati 2014-11-06 11:04:00 UTC

COMMIT: http://review.gluster.org/9012 committed in master by Kaushal M (kaushal) 
------
commit 97ccd45fb66a63c0b2436a0245dfb9490e2941b7
Author: Atin Mukherjee <amukherj>
Date:   Mon Oct 27 12:12:03 2014 +0530

    glusterd : release cluster wide locks in op-sm during failures
    
    glusterd op-sm infrastructure has some loophole in handing error cases in
    locking/unlocking phases which ends up having stale locks restricting
    further transactions to go through.
    
    This patch still doesn't handle all possible unlocking error cases as the
    framework neither has retry mechanism nor the lock timeout. For eg - if
    unlocking fails in one of the peer, cluster wide lock is not released and
    further transaction can not be made until and unless originator node/the node
    where unlocking failed is restarted.
    
    Following test cases were executed (with the help of gdb) after applying this
    patch:
    
    * RPC timesout in lock cbk
    * Decoding of RPC response in lock cbk fails
    * RPC response is received from unknown peer in lock cbk
    * Setting peerinfo in dictionary fails while sending lock request for first peer
      in the list
    * Setting peerinfo in dictionary fails while sending lock request for other
      peers
    * Lock RPC could not be sent for peers
    
    For all above test cases the success criteria is not to have any stale locks
    
    Change-Id: Ia1550341c31005c7850ee1b2697161c9ca04b01a
    BUG: 1154635
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/9012
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Kaushal M <kaushal>

Comment 11 alex.smith 2014-11-28 16:16:59 UTC

Still seeing this after backporting your patch to 3.6.1

[2014-11-28 16:08:38.353520] E [glusterd-utils.c:148:glusterd_lock] 0-management: Unable to get lock for uuid: 1e1b0cd2-208e-4984-8735-2c3d45df93cf, lock held by: 1e1b0cd2-208e-4984-8735-2c3d45df93cf

This was the commit I patched in:

https://github.com/gluster/glusterfs/commit/97ccd45fb66a63c0b2436a0245dfb9490e2941b7

Alex.

Comment 12 alex.smith 2014-12-01 10:41:50 UTC

After doing a few tests today it seems that the problem reoccurs if you run a command at the same time on two or more hosts (I was using `gluster v status`). So I can now replicate this at will.

The patch certainly reduced the time it took to see the problem. I end up seeing the issue quite often because I have nagios checks that use gluster outputs to ensure bricks are online.

Alex.

Comment 13 Atin Mukherjee 2014-12-02 10:48:47 UTC

Currently the volume status command is safeguarded by cluster wide lock, so if multiple volume status commands are run at the same time one would definitely fail, however we can think of making it lock less. Certainly a candidate for enhancement.

Comment 14 alex.smith 2014-12-04 16:21:25 UTC

That's fair enough, I should add that two nodes both attempting to lock at the same time locks gluster permanently (until glusterd is reloaded) in case that wasn't clear.

For now I have changed my checks to use a single node and haven't seen the issue reoccur.

Thanks

Comment 15 Niels de Vos 2015-05-14 17:28:02 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 16 Niels de Vos 2015-05-14 17:35:39 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 17 Niels de Vos 2015-05-14 17:38:01 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 18 Niels de Vos 2015-05-14 17:44:09 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.