Bug 844682 - call_bail of a frame in glusterd might lead to stale locks in the cluster
call_bail of a frame in glusterd might lead to stale locks in the cluster
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
2.0
Unspecified Unspecified
high Severity unspecified
: ---
: ---
Assigned To: Kaushal
SATHEESARAN
:
Depends On: 843003
Blocks: 844803
  Show dependency treegraph
 
Reported: 2012-07-31 07:54 EDT by Vijay Bellur
Modified: 2016-05-11 09:49 EDT (History)
6 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0qa5-1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 843003
: 844803 (view as bug list)
Environment:
Last Closed: 2013-09-23 18:38:56 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Vijay Bellur 2012-07-31 07:54:13 EDT
+++ This bug was initially created as a clone of Bug #843003 +++

Description of problem:

Suppose there is a cluster with large number of files and directories. Now gluster volume status command is executed, and the request is sent to glusterfsd processes for obtaining the data. Now if glusterfsd does not respond in 30 minutes, then the frame which sent the request gets bailed out in glusterd. But it does not respond the reject to the source glusterd which originated the command. Thus the cluster lock held by the source glusterd will be there always and other operations fail since they cannot acquire the lock.

Source glusterd logs:

2012-07-05 17:52:44.211023] I [glusterd-op-sm.c:2039:glusterd_op_ac_send_stage_op] 0-glusterd: Sent op req to 3 peers
[2012-07-05 17:52:44.211237] I [glusterd-rpc-ops.c:880:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: 92d50993-15a5-42af-92f3-a6ff7cfddd43
[2012-07-05 17:52:44.211284] I [glusterd-rpc-ops.c:880:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: 0b17d7cf-c86a-4d82-929b-efb1ca2e331c
[2012-07-05 17:52:44.211314] I [glusterd-rpc-ops.c:880:glusterd3_1_stage_op_cbk] 0-glusterd: Received ACC from uuid: c6876b56-9729-4a98-8eea-fc9293cf92b0
[2012-07-05 17:52:44.216790] I [glusterd-op-sm.c:2384:glusterd_op_ac_send_commit_op] 0-management: Sent op req to 3 peers
[2012-07-05 17:52:44.223543] I [glusterd-rpc-ops.c:1316:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: 0b17d7cf-c86a-4d82-929b-efb1ca2e331c
[2012-07-05 17:52:44.225221] I [glusterd-rpc-ops.c:1316:glusterd3_1_commit_op_cbk] 0-glusterd: Received ACC from uuid: c6876b56-9729-4a98-8eea-fc9293cf92b0
[2012-07-05 17:54:44.335825] I [glusterd-handler.c:2646:glusterd_handle_status_volume] 0-management: Received status volume req for volume new
[2012-07-05 17:54:44.335879] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4, lock held by: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 17:54:44.335901] E [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1

Destination glusterd logs: (call bail happened here)

[2012-07-05 18:11:47.816910] E [rpc-clnt.c:208:call_bail] 0-management: bailing out frame type(brick operations) op(--(4)) x
id = 0x1x sent = 2012-07-05 17:41:46.943808. timeout = 1800

[2012-07-05 23:37:15.490400] I [glusterd-handler.c:497:glusterd_handle_cluster_lock] 0-glusterd: Received LOCK from uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:37:15.490438] I [glusterd-utils.c:285:glusterd_lock] 0-glusterd: Cluster lock held by adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:37:15.490472] I [glusterd-handler.c:1315:glusterd_op_lock_send_resp] 0-glusterd: Responded, ret: 0
[2012-07-05 23:37:15.490906] I [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received op from uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:37:15.490969] I [glusterd-handler.c:1417:glusterd_op_stage_send_resp] 0-glusterd: Responded to stage, ret: 0
[2012-07-05 23:37:15.492447] I [glusterd-handler.c:542:glusterd_req_ctx_create] 0-glusterd: Received op from uuid: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:42:50.244253] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: 92d50993-15a5-42af-92f3-a6ff7cfddd43, lock held by: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:42:50.244303] E [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1
[2012-07-05 23:42:57.927386] I [glusterd-handler.c:2646:glusterd_handle_status_volume] 0-management: Received status volume req for volume new
[2012-07-05 23:42:57.927436] E [glusterd-utils.c:277:glusterd_lock] 0-glusterd: Unable to get lock for uuid: 92d50993-15a5-42af-92f3-a6ff7cfddd43, lock held by: adfa231a-d8e0-4d6b-bc11-ad29b987ace4
[2012-07-05 23:42:57.927457] E [glusterd-handler.c:453:glusterd_op_txn_begin] 0-management: Unable to acquire local lock, ret: -1


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 2 Kaushal 2012-10-09 05:07:46 EDT
Commit 6930c69 (rpc: Reduce frame-timeout for glusterd connections) accepted in upstream/master for this issue. Reviewed at http://review.gluster.com/3803
Comment 3 SATHEESARAN 2013-07-23 15:56:40 EDT
Do we have any test steps to verify this bug ?
Comment 4 SATHEESARAN 2013-08-07 08:01:52 EDT
I did the steps mentioned wrt, BZ https://bugzilla.redhat.com/show_bug.cgi?id=866758.

1. Created a replica volume with 2 bricks, from 2 different RHS Node
2. Started the volume and fuse mounted it
3. Powered down, one of VM abruptly
4. "gluster volume status" command responded post 10 minutes
5. There were no lock held as such on the cluster post 10 minutes

Considering this as the valid test for checking call_bail of a frame, as consulted with kaushal, moving it to VERIFIED

Verified with RHS2.1 - glusterfs-3.4.0.17rhs-1
Comment 5 Scott Haines 2013-09-23 18:38:56 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html
Comment 6 Scott Haines 2013-09-23 18:41:32 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.