Bug 844682 - call_bail of a frame in glusterd might lead to stale locks in the cluster
Summary: call_bail of a frame in glusterd might lead to stale locks in the cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: ---
Assignee: Kaushal
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On: 843003
Blocks: 844803
TreeView+ depends on / blocked
 
Reported: 2012-07-31 11:54 UTC by Vijay Bellur
Modified: 2016-05-11 13:49 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.4.0qa5-1
Doc Type: Bug Fix
Doc Text:
Clone Of: 843003
: 844803 (view as bug list)
Environment:
Last Closed: 2013-09-23 22:38:56 UTC
Embargoed:


Attachments (Terms of Use)

Description Vijay Bellur 2012-07-31 11:54:13 UTC
+++ This bug was initially created as a clone of Bug #843003 +++

Description of problem:

Suppose there is a cluster with large number of files and directories. Now gluster volume status command is executed, and the request is sent to glusterfsd processes for obtaining the data. Now if glusterfsd does not respond in 30 minutes, then the frame which sent the request gets bailed out in glusterd. But it does not respond the reject to the source glusterd which originated the command. Thus the cluster lock held by the source glusterd will be there always and other operations fail since they cannot acquire the lock.

Source glusterd logs:

2012-07-05 17:52:44.211023] I [glusterd-op-sm.c:2039:glusterd_op_ac_send_stage_op] 0-glusterd: Sent op req to 3 peers
[2012-07-05 17:52:44.211237] I [glusterd-rpc-ops.c:880:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: 92d50993-15a5-42af-92f3-a6ff7cfddd43
[2012-07-05 17:52:44.211284] I [glusterd-rpc-ops.c:880:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: 0b17d7cf-c86a-4d82-929b-efb1ca2e331c
[2012-07-05 17:52:44.211314] I [glusterd-rpc-ops.c:880:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: c6876b56-9729-4a98-8eea-fc9293cf92b0
[2012-07-05 17:52:44.216790] I [glusterd-op-sm.c:2384:glusterd_op_ac_send_commit_op] 0-management: Sent op req to 3 peers
[2012-07-05 17:52:44.223543] I [glusterd-rpc-ops.c:1316:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: 0b17d7cf-c86a-4d82-929b-efb1ca2e331c
[2012-07-05 17:52:44.225221] I [glusterd-rpc-ops.c:1316:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: c6876b56-9729-4a98-8eea-fc9293cf92b0
[2012-07-05 17:54:44.335825] I [glusterd-handler.c:2646:glusterd_handle_status_volume] 0-management: Received status volume req for volume new
[2012-07-05 17:54:44.335879] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4, lock held by: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 17:54:44.335901] E [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1

Destination glusterd logs: (call bail happened here)

[2012-07-05 18:11:47.816910] E [rpc-clnt.c:208:call_bail] 0-management: bailing out frame type(brick operations) op(--(4)) x
id = 0x1x sent = 2012-07-05 17:41:46.943808. timeout = 1800

[2012-07-05 23:37:15.490400] I [glusterd-handler.c:497:glusterd_handle_cluster_lock] 0-glusterd: Received LOCK from uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:37:15.490438] I [glusterd-utils.c:285:glusterd_lock] 0-glusterd: Cluster lock held by adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:37:15.490472] I [glusterd-handler.c:1315:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0
[2012-07-05 23:37:15.490906] I [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received op from uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:37:15.490969] I [glusterd-handler.c:1417:glusterd_op_stage_send_resp] 0-glusterd: Responded to stage, ret: 0
[2012-07-05 23:37:15.492447] I [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received op from uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:42:50.244253] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: 92d50993-15a5-42af-92f3-a6ff7cfddd43, lock held by: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:42:50.244303] E [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1
[2012-07-05 23:42:57.927386] I [glusterd-handler.c:2646:glusterd_handle_status_volume] 0-management: Received status volume req for volume new
[2012-07-05 23:42:57.927436] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: 92d50993-15a5-42af-92f3-a6ff7cfddd43, lock held by: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:42:57.927457] E [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Kaushal 2012-10-09 09:07:46 UTC
Commit 6930c69 (rpc: Reduce frame-timeout for glusterd connections) accepted in upstream/master for this issue. Reviewed at http://review.gluster.com/3803

Comment 3 SATHEESARAN 2013-07-23 19:56:40 UTC
Do we have any test steps to verify this bug ?

Comment 4 SATHEESARAN 2013-08-07 12:01:52 UTC
I did the steps mentioned wrt, BZ https://bugzilla.redhat.com/show_bug.cgi?id=866758.

1. Created a replica volume with 2 bricks, from 2 different RHS Node
2. Started the volume and fuse mounted it
3. Powered down, one of VM abruptly
4. "gluster volume status" command responded post 10 minutes
5. There were no lock held as such on the cluster post 10 minutes

Considering this as the valid test for checking call_bail of a frame, as consulted with kaushal, moving it to VERIFIED

Verified with RHS2.1 - glusterfs-3.4.0.17rhs-1

Comment 5 Scott Haines 2013-09-23 22:38:56 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 6 Scott Haines 2013-09-23 22:41:32 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html


Note You need to log in before you can comment on or make changes to this bug.