Bug 770513

Summary: [glusterfs-3.3.0qa18]: gluster volume stop and start made application hang
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:41:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: glusterfs-3.3.0qa40 Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 817967    

Description Raghavendra Bhat 2011-12-27 07:40:51 UTC
Description of problem:
On a 2x2 distributed replicate volume. Mounted a fuse client and kept sanity tests to run. Immedietly stopped the volume and restarted it. The sanity test (executing lftest of ltp test suite) hung.
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Statedump information ===>

[global.callpool.stack.2]
uid=0
gid=0
pid=1666
unique=114637
op=WRITE
type=1
cnt=6

[global.callpool.stack.2.frame.1]
ref_count=1
translator=mirror-write-behind
complete=0

[global.callpool.stack.2.frame.2]
ref_count=0
translator=mirror-client-2
complete=1
parent=mirror-replicate-1
wind_from=afr_getxattr
wind_to=children[call_child]->fops->getxattr
unwind_from=client3_1_getxattr_cbk
unwind_to=afr_getxattr_cbk

[global.callpool.stack.2.frame.3]
ref_count=0
translator=mirror-replicate-1
complete=1
parent=mirror-dht
wind_from=dht_getxattr
wind_to=subvol->fops->getxattr
unwind_from=afr_getxattr_cbk
unwind_to=dht_getxattr_cbk

[global.callpool.stack.2.frame.4]
ref_count=0
translator=mirror-replicate-0
complete=0
parent=mirror-dht
wind_from=dht_getxattr
wind_to=subvol->fops->getxattr
unwind_to=dht_getxattr_cbk

In afr_getxattr we intialize local structure and collect return value in op_ret.
AFR_LOCAL_INIT which intializes the local structure returns values less than zero (infact -ve of the corresponding errno happened) upon some failure. But we will unwind only if op_ret is -1. thus it leads to lost frames and application hanging.

Comment 1 Pranith Kumar K 2011-12-29 05:27:43 UTC
*** Bug 770554 has been marked as a duplicate of this bug. ***

Comment 2 Anand Avati 2011-12-29 06:07:27 UTC
CHANGE: http://review.gluster.com/2539 (cluster/afr: Handle error cases in local init) merged in master by Vijay Bellur (vijay)

Comment 3 Raghavendra Bhat 2012-05-10 06:44:19 UTC
Repeated the same test i.e. stopping and immedietly starting the volume when ltp-tests are being run from the sanity tests and the tests did not hang.