Bug 1303325

Summary: Glusterd not releasing cli lock for RHGS 3.0.4 and op-version 2
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Chris Blum <cblum>
Component: coreAssignee: Gaurav Kumar Garg <ggarg>
Status: CLOSED NOTABUG QA Contact: Anoop <annair>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: ggarg, nsathyan, rhs-bugs, smohan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-01 09:21:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Blum 2016-01-30 19:54:38 UTC
Description of problem:
It seems like if RHGS 3.0.4 is run with op-version 2 it is easily possible to cause a situation where the cli lock is not released.
Therefore no additional cli commands can be executed on the storage pool until the glusterd process that allegedly has the lock is restarted 

Version-Release number of selected component (if applicable):
RHGS 3.0.4
gluster --version is 3.6.0.53 (includes customer specific hotfixes)

How reproducible:
always

Steps to Reproduce:
1) Execute 'gluster vol status' on all nodes in parallel
  --> All commands will fail
2) Lock will be held by one glusterd process, but even this glusterd process will be unable to execute any cli commands

Actual results:
Lock is allegedly held by one glusterd process, but this one doesn't know about it himself

Expected results:
Lock should be released

Additional info:

glusterd log file with debug logging enabled:
[2016-01-30 18:54:34.475812] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b3e6c376-c288-4c5e-88e1-7698a2d87e7b
[2016-01-30 18:54:34.478629] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-01-30 18:54:34.482365] I [MSGID: 106490] [glusterd-handler.c:2530:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: e0240641-8246-41a8-818d-96dcd0b33684
[2016-01-30 18:54:34.482699] I [MSGID: 106493] [glusterd-handler.c:3771:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhgs2 (0), ret: 0
[2016-01-30 18:54:34.489565] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: e0240641-8246-41a8-818d-96dcd0b33684
[2016-01-30 18:54:34.489634] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-01-30 18:54:34.496387] I [MSGID: 106493] [glusterd-rpc-ops.c:692:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f00f9da2-14ce-43da-b55f-16c0d1efc594
[2016-01-30 18:54:34.496541] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: e0240641-8246-41a8-818d-96dcd0b33684
[2016-01-30 18:54:34.496583] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-01-30 18:54:34.503255] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: e0240641-8246-41a8-818d-96dcd0b33684
[2016-01-30 18:54:34.503315] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-01-30 18:54:34.509304] I [MSGID: 106493] [glusterd-rpc-ops.c:692:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b3e6c376-c288-4c5e-88e1-7698a2d87e7b
[2016-01-30 18:54:34.526280] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: e0240641-8246-41a8-818d-96dcd0b33684
[2016-01-30 18:54:34.526507] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-01-30 18:54:34.534227] I [MSGID: 106493] [glusterd-rpc-ops.c:478:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: e0240641-8246-41a8-818d-96dcd0b33684, host: rhgs2, port: 0
[2016-01-30 18:54:34.537575] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: e0240641-8246-41a8-818d-96dcd0b33684
[2016-01-30 18:54:34.537636] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-01-30 18:54:34.544401] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: f00f9da2-14ce-43da-b55f-16c0d1efc594
[2016-01-30 18:54:34.547102] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-01-30 18:54:34.550798] I [MSGID: 106493] [glusterd-rpc-ops.c:692:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: f00f9da2-14ce-43da-b55f-16c0d1efc594
[2016-01-30 18:54:34.551713] I [MSGID: 106493] [glusterd-rpc-ops.c:692:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: e0240641-8246-41a8-818d-96dcd0b33684
[2016-01-30 18:54:34.570773] I [MSGID: 106493] [glusterd-rpc-ops.c:692:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b3e6c376-c288-4c5e-88e1-7698a2d87e7b
[2016-01-30 18:54:34.584747] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: e0240641-8246-41a8-818d-96dcd0b33684
[2016-01-30 18:54:34.584858] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-01-30 18:54:34.601728] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b3e6c376-c288-4c5e-88e1-7698a2d87e7b
[2016-01-30 18:54:34.604335] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2016-01-30 18:56:38.066038] E [MSGID: 106155] [glusterd-utils.c:165:glusterd_lock] 0-management: Unable to get lock for uuid: 04dce505-1c50-41e9-8a64-e4eff53796ee, lock held by: 04dce505-1c50-41e9-8a64-e4eff53796ee
[2016-01-30 18:56:38.066062] E [MSGID: 106155] [glusterd-syncop.c:1783:gd_sync_task_begin] 0-management: Unable to acquire lock
[2016-01-30 18:58:02.134248] E [MSGID: 106155] [glusterd-utils.c:165:glusterd_lock] 0-management: Unable to get lock for uuid: 04dce505-1c50-41e9-8a64-e4eff53796ee, lock held by: 04dce505-1c50-41e9-8a64-e4eff53796ee
[2016-01-30 18:58:02.134250] E [MSGID: 106155] [glusterd-syncop.c:1783:gd_sync_task_begin] 0-management: Unable to acquire lock
[2016-01-30 19:04:45.571516] E [rpc-clnt.c:201:call_bail] 0-management: bailing out frame type(glusterd mgmt) op(--(4)) xid = 0xb sent = 2016-01-30 18:54:37.976364. timeout = 600 for 192.168.37.200:24007
[2016-01-30 19:04:45.571627] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Commit failed on RHGS1. Please check log file for details.
[2016-01-30 19:14:46.634224] E [rpc-clnt.c:201:call_bail] 0-management: bailing out frame type(glusterd mgmt) op(--(2)) xid = 0xc sent = 2016-01-30 19:04:45.571845. timeout = 600 for 192.168.37.102:24007
[2016-01-30 19:14:46.634304] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Unlocking failed on rhgs4. Please check log file for details.
[2016-01-30 19:14:46.634360] E [rpc-clnt.c:201:call_bail] 0-management: bailing out frame type(glusterd mgmt) op(--(2)) xid = 0xc sent = 2016-01-30 19:04:45.571878. timeout = 600 for 192.168.37.200:24007
[2016-01-30 19:14:46.634387] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Unlocking failed on RHGS1. Please check log file for details.
[2016-01-30 19:14:46.634423] E [MSGID: 106152] [glusterd-syncop.c:1562:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s)
[2016-01-30 19:14:46.634547] E [glusterd-op-sm.c:349:glusterd_clear_txn_opinfo] (-->/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_op_begin_synctask+0x3b) [0x7fe5e696da5b] -->/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(gd_sync_task_begin+0x18a) [0x7fe5e696d20a] -->/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_clear_txn_opinfo+0xd7) [0x7fe5e68dfab7] ) 0-management: Unable to get transaction opinfo for transaction ID : 00000000-0000-0000-0000-000000000000
[2016-01-30 19:14:46.634566] E [MSGID: 106154] [glusterd-syncop.c:1908:gd_sync_task_begin] 0-management: Unable to clear transaction's opinfo for transaction ID : 00000000-0000-0000-0000-000000000000
[2016-01-30 19:14:46.634605] I [socket.c:3436:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2016-01-30 19:14:46.634615] E [rpcsvc.c:1312:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 27) to rpc-transport (socket.management)
[2016-01-30 19:14:46.634630] E [MSGID: 106430] [glusterd-utils.c:400:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2016-01-30 19:15:17.385249] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Locking failed on rhgs4. Please check log file for details.
[2016-01-30 19:15:17.385607] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Locking failed on RHGS1. Please check log file for details.
[2016-01-30 19:15:17.385984] E [MSGID: 106151] [glusterd-syncop.c:1858:gd_sync_task_begin] 0-management: Locking Peers Failed.
[2016-01-30 19:15:27.267040] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Locking failed on rhgs4. Please check log file for details.
[2016-01-30 19:15:27.267212] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Locking failed on RHGS1. Please check log file for details.
[2016-01-30 19:15:27.267840] E [MSGID: 106151] [glusterd-syncop.c:1858:gd_sync_task_begin] 0-management: Locking Peers Failed.
[2016-01-30 19:15:43.484237] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Locking failed on rhgs4. Please check log file for details.
[2016-01-30 19:15:43.484334] E [MSGID: 106153] [glusterd-syncop.c:113:gd_collate_errors] 0-glusterd: Locking failed on RHGS1. Please check log file for details.
[2016-01-30 19:15:43.484398] E [MSGID: 106151] [glusterd-syncop.c:1858:gd_sync_task_begin] 0-management: Locking Peers Failed.

Comment 1 Gaurav Kumar Garg 2016-01-31 06:50:41 UTC
In RHS 2.* release we was having cluster-wide lock means one transaction can issue at one time. But in RHGS-3* release we have introduced Volume-wide lock you can see https://bugzilla.redhat.com/show_bug.cgi?id=1011470 bug and commit id:  97ce783de326b51fcba65737f07db2c314d1e218  in RHGS 3* downstream and http://review.gluster.org/#/c/5994/ patch in upstream.

Will update further.

Comment 3 Chris Blum 2016-01-31 23:30:49 UTC
we could avoid further locking right now by trying to avoid any parallel CLI command executions. This has worked now for ~22h.

Comment 4 Gaurav Kumar Garg 2016-02-01 09:21:05 UTC
This is a expected behavior. I don't think its a bug. Closing this bug. Feel free to reopen this bug if anybody have issue.