Bug 770513 - [glusterfs-3.3.0qa18]: gluster volume stop and start made application hang
Summary: [glusterfs-3.3.0qa18]: gluster volume stop and start made application hang
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
: 770554 (view as bug list)
Depends On:
Blocks: 817967
TreeView+ depends on / blocked
 
Reported: 2011-12-27 07:40 UTC by Raghavendra Bhat
Modified: 2013-07-24 17:41 UTC (History)
1 user (show)

Fixed In Version: glusterfs-3.4.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-07-24 17:41:58 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions: glusterfs-3.3.0qa40
Embargoed:


Attachments (Terms of Use)

Description Raghavendra Bhat 2011-12-27 07:40:51 UTC
Description of problem:
On a 2x2 distributed replicate volume. Mounted a fuse client and kept sanity tests to run. Immedietly stopped the volume and restarted it. The sanity test (executing lftest of ltp test suite) hung.
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Statedump information ===>

[global.callpool.stack.2]
uid=0
gid=0
pid=1666
unique=114637
op=WRITE
type=1
cnt=6

[global.callpool.stack.2.frame.1]
ref_count=1
translator=mirror-write-behind
complete=0

[global.callpool.stack.2.frame.2]
ref_count=0
translator=mirror-client-2
complete=1
parent=mirror-replicate-1
wind_from=afr_getxattr
wind_to=children[call_child]->fops->getxattr
unwind_from=client3_1_getxattr_cbk
unwind_to=afr_getxattr_cbk

[global.callpool.stack.2.frame.3]
ref_count=0
translator=mirror-replicate-1
complete=1
parent=mirror-dht
wind_from=dht_getxattr
wind_to=subvol->fops->getxattr
unwind_from=afr_getxattr_cbk
unwind_to=dht_getxattr_cbk

[global.callpool.stack.2.frame.4]
ref_count=0
translator=mirror-replicate-0
complete=0
parent=mirror-dht
wind_from=dht_getxattr
wind_to=subvol->fops->getxattr
unwind_to=dht_getxattr_cbk

In afr_getxattr we intialize local structure and collect return value in op_ret.
AFR_LOCAL_INIT which intializes the local structure returns values less than zero (infact -ve of the corresponding errno happened) upon some failure. But we will unwind only if op_ret is -1. thus it leads to lost frames and application hanging.

Comment 1 Pranith Kumar K 2011-12-29 05:27:43 UTC
*** Bug 770554 has been marked as a duplicate of this bug. ***

Comment 2 Anand Avati 2011-12-29 06:07:27 UTC
CHANGE: http://review.gluster.com/2539 (cluster/afr: Handle error cases in local init) merged in master by Vijay Bellur (vijay)

Comment 3 Raghavendra Bhat 2012-05-10 06:44:19 UTC
Repeated the same test i.e. stopping and immedietly starting the volume when ltp-tests are being run from the sanity tests and the tests did not hang.


Note You need to log in before you can comment on or make changes to this bug.