+++ This bug was initially created as a clone of Bug #843003 +++ Description of problem: Suppose there is a cluster with large number of files and directories. Now gluster volume status command is executed, and the request is sent to glusterfsd processes for obtaining the data. Now if glusterfsd does not respond in 30 minutes, then the frame which sent the request gets bailed out in glusterd. But it does not respond the reject to the source glusterd which originated the command. Thus the cluster lock held by the source glusterd will be there always and other operations fail since they cannot acquire the lock. Source glusterd logs: 2012-07-05 17:52:44.211023] I [glusterd-op-sm.c:2039:glusterd_op_ac_send_stage_op] 0-glusterd: Sent op req to 3 peers [2012-07-05 17:52:44.211237] I [glusterd-rpc-ops.c:880:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: 92d50993-15a5-42af-92f3-a6ff7cfddd43 [2012-07-05 17:52:44.211284] I [glusterd-rpc-ops.c:880:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: 0b17d7cf-c86a-4d82-929b-efb1ca2e331c [2012-07-05 17:52:44.211314] I [glusterd-rpc-ops.c:880:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: c6876b56-9729-4a98-8eea-fc9293cf92b0 [2012-07-05 17:52:44.216790] I [glusterd-op-sm.c:2384:glusterd_op_ac_send_commit_op] 0-management: Sent op req to 3 peers [2012-07-05 17:52:44.223543] I [glusterd-rpc-ops.c:1316:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: 0b17d7cf-c86a-4d82-929b-efb1ca2e331c [2012-07-05 17:52:44.225221] I [glusterd-rpc-ops.c:1316:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: c6876b56-9729-4a98-8eea-fc9293cf92b0 [2012-07-05 17:54:44.335825] I [glusterd-handler.c:2646:glusterd_handle_status_volume] 0-management: Received status volume req for volume new [2012-07-05 17:54:44.335879] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4, lock held by: adfa231a-d8e0-4d6b-bc11-ad29b987ace4 [2012-07-05 17:54:44.335901] E [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1 Destination glusterd logs: (call bail happened here) [2012-07-05 18:11:47.816910] E [rpc-clnt.c:208:call_bail] 0-management: bailing out frame type(brick operations) op(--(4)) x id = 0x1x sent = 2012-07-05 17:41:46.943808. timeout = 1800 [2012-07-05 23:37:15.490400] I [glusterd-handler.c:497:glusterd_handle_cluster_lock] 0-glusterd: Received LOCK from uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4 [2012-07-05 23:37:15.490438] I [glusterd-utils.c:285:glusterd_lock] 0-glusterd: Cluster lock held by adfa231a-d8e0-4d6b-bc11-ad29b987ace4 [2012-07-05 23:37:15.490472] I [glusterd-handler.c:1315:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0 [2012-07-05 23:37:15.490906] I [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received op from uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4 [2012-07-05 23:37:15.490969] I [glusterd-handler.c:1417:glusterd_op_stage_send_resp] 0-glusterd: Responded to stage, ret: 0 [2012-07-05 23:37:15.492447] I [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received op from uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4 [2012-07-05 23:42:50.244253] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: 92d50993-15a5-42af-92f3-a6ff7cfddd43, lock held by: adfa231a-d8e0-4d6b-bc11-ad29b987ace4 [2012-07-05 23:42:50.244303] E [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1 [2012-07-05 23:42:57.927386] I [glusterd-handler.c:2646:glusterd_handle_status_volume] 0-management: Received status volume req for volume new [2012-07-05 23:42:57.927436] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: 92d50993-15a5-42af-92f3-a6ff7cfddd43, lock held by: adfa231a-d8e0-4d6b-bc11-ad29b987ace4 [2012-07-05 23:42:57.927457] E [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Commit 6930c69 (rpc: Reduce frame-timeout for glusterd connections) accepted in upstream/master for this issue. Reviewed at http://review.gluster.com/3803
Do we have any test steps to verify this bug ?
I did the steps mentioned wrt, BZ https://bugzilla.redhat.com/show_bug.cgi?id=866758. 1. Created a replica volume with 2 bricks, from 2 different RHS Node 2. Started the volume and fuse mounted it 3. Powered down, one of VM abruptly 4. "gluster volume status" command responded post 10 minutes 5. There were no lock held as such on the cluster post 10 minutes Considering this as the valid test for checking call_bail of a frame, as consulted with kaushal, moving it to VERIFIED Verified with RHS2.1 - glusterfs-3.4.0.17rhs-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html