Description of problem: On a 2x2 distributed replicate volume. Mounted a fuse client and kept sanity tests to run. Immedietly stopped the volume and restarted it. The sanity test (executing lftest of ltp test suite) hung. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Statedump information ===> [global.callpool.stack.2] uid=0 gid=0 pid=1666 unique=114637 op=WRITE type=1 cnt=6 [global.callpool.stack.2.frame.1] ref_count=1 translator=mirror-write-behind complete=0 [global.callpool.stack.2.frame.2] ref_count=0 translator=mirror-client-2 complete=1 parent=mirror-replicate-1 wind_from=afr_getxattr wind_to=children[call_child]->fops->getxattr unwind_from=client3_1_getxattr_cbk unwind_to=afr_getxattr_cbk [global.callpool.stack.2.frame.3] ref_count=0 translator=mirror-replicate-1 complete=1 parent=mirror-dht wind_from=dht_getxattr wind_to=subvol->fops->getxattr unwind_from=afr_getxattr_cbk unwind_to=dht_getxattr_cbk [global.callpool.stack.2.frame.4] ref_count=0 translator=mirror-replicate-0 complete=0 parent=mirror-dht wind_from=dht_getxattr wind_to=subvol->fops->getxattr unwind_to=dht_getxattr_cbk In afr_getxattr we intialize local structure and collect return value in op_ret. AFR_LOCAL_INIT which intializes the local structure returns values less than zero (infact -ve of the corresponding errno happened) upon some failure. But we will unwind only if op_ret is -1. thus it leads to lost frames and application hanging.
*** Bug 770554 has been marked as a duplicate of this bug. ***
CHANGE: http://review.gluster.com/2539 (cluster/afr: Handle error cases in local init) merged in master by Vijay Bellur (vijay)
Repeated the same test i.e. stopping and immedietly starting the volume when ltp-tests are being run from the sanity tests and the tests did not hang.