Bug 765188 (GLUSTER-3456)

Summary: [b6e3e9c480be4226925b51c5e9ee0c368aa94a6d]: client hanging
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: pre-releaseCC: gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: glusterfs-3.3beta Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Raghavendra Bhat 2011-08-21 18:33:07 UTC
After the crash of 3455 (where one client crashed and other client was still running), tried to remove the contents by doing rm -rf. But it hung. This is what statedump says.


[global.callpool.stack.1]
global.callpool.stack.1.uid=0
global.callpool.stack.1.gid=0
global.callpool.stack.1.pid=17773
global.callpool.stack.1.unique=1072429
global.callpool.stack.1.op=LOOKUP
global.callpool.stack.1.type=1
global.callpool.stack.1.cnt=3

[global.callpool.stack.1.frame.1]
global.callpool.stack.1.frame.1.ref_count=1
global.callpool.stack.1.frame.1.translator=fuse
global.callpool.stack.1.frame.1.complete=0

[global.callpool.stack.1.frame.2]
global.callpool.stack.1.frame.2.ref_count=0
global.callpool.stack.1.frame.2.translator=mirror-stat-prefetch
global.callpool.stack.1.frame.2.complete=0
global.callpool.stack.1.frame.2.parent=mirror
global.callpool.stack.1.frame.2.wind_from=io_stats_lookup
global.callpool.stack.1.frame.2.wind_to=FIRST_CHILD(this)->fops->lookup
global.callpool.stack.1.frame.2.unwind_to=io_stats_lookup_cbk

[global.callpool.stack.1.frame.3]
global.callpool.stack.1.frame.3.ref_count=1
global.callpool.stack.1.frame.3.translator=mirror
global.callpool.stack.1.frame.3.complete=0
global.callpool.stack.1.frame.3.parent=fuse
global.callpool.stack.1.frame.3.wind_from=fuse_lookup_resume
global.callpool.stack.1.frame.3.wind_to=xl->fops->lookup
global.callpool.stack.1.frame.3.unwind_to=fuse_lookup_cbk

There are still many lookups being hung (not just on stat-prefetch but also in afr).


[global.callpool.stack.4]
global.callpool.stack.4.uid=0
global.callpool.stack.4.gid=0
global.callpool.stack.4.pid=17324
global.callpool.stack.4.unique=1072029
global.callpool.stack.4.op=LOOKUP
global.callpool.stack.4.type=1
global.callpool.stack.4.cnt=10

[global.callpool.stack.4.frame.1]
global.callpool.stack.4.frame.1.ref_count=1
global.callpool.stack.4.frame.1.translator=fuse
global.callpool.stack.4.frame.1.complete=0

[global.callpool.stack.4.frame.2]
global.callpool.stack.4.frame.2.ref_count=0
global.callpool.stack.4.frame.2.translator=mirror-client-1
global.callpool.stack.4.frame.2.complete=1
global.callpool.stack.4.frame.2.parent=mirror-replicate-0
global.callpool.stack.4.frame.2.wind_from=afr_lookup
global.callpool.stack.4.frame.2.wind_to=priv->children[i]->fops->lookup
global.callpool.stack.4.frame.2.unwind_from=client3_1_lookup_cbk
global.callpool.stack.4.frame.2.unwind_to=afr_lookup_cbk

[global.callpool.stack.4.frame.3]
global.callpool.stack.4.frame.3.ref_count=0
global.callpool.stack.4.frame.3.translator=mirror-client-0
global.callpool.stack.4.frame.3.complete=1
global.callpool.stack.4.frame.3.parent=mirror-replicate-0
global.callpool.stack.4.frame.3.wind_from=afr_lookup
global.callpool.stack.4.frame.3.wind_to=priv->children[i]->fops->lookup
global.callpool.stack.4.frame.3.unwind_from=client3_1_lookup_cbk
global.callpool.stack.4.frame.3.unwind_to=afr_lookup_cbk

[global.callpool.stack.4.frame.4]
global.callpool.stack.4.frame.4.ref_count=0
global.callpool.stack.4.frame.4.translator=mirror-replicate-0
global.callpool.stack.4.frame.4.complete=0
global.callpool.stack.4.frame.4.parent=mirror-write-behind
global.callpool.stack.4.frame.4.wind_from=default_lookup
global.callpool.stack.4.frame.4.wind_to=FIRST_CHILD(this)->fops->lookup
global.callpool.stack.4.frame.4.unwind_to=default_lookup_cbk

[global.callpool.stack.4.frame.5]
global.callpool.stack.4.frame.5.ref_count=1
global.callpool.stack.4.frame.5.translator=mirror-write-behind
global.callpool.stack.4.frame.5.complete=0
global.callpool.stack.4.frame.5.parent=mirror-read-ahead
global.callpool.stack.4.frame.5.wind_from=default_lookup
global.callpool.stack.4.frame.5.wind_to=FIRST_CHILD(this)->fops->lookup
global.callpool.stack.4.frame.5.unwind_to=default_lookup_cbk

[global.callpool.stack.4.frame.6]
global.callpool.stack.4.frame.6.ref_count=1
global.callpool.stack.4.frame.6.translator=mirror-read-ahead
global.callpool.stack.4.frame.6.complete=0
global.callpool.stack.4.frame.6.parent=mirror-io-cache
global.callpool.stack.4.frame.6.wind_from=ioc_lookup
global.callpool.stack.4.frame.6.wind_to=FIRST_CHILD (this)->fops->lookup
global.callpool.stack.4.frame.6.unwind_to=ioc_lookup_cbk

[global.callpool.stack.4.frame.7]
global.callpool.stack.4.frame.7.ref_count=1
global.callpool.stack.4.frame.7.translator=mirror-io-cache
global.callpool.stack.4.frame.7.complete=0
global.callpool.stack.4.frame.7.parent=mirror-quick-read
global.callpool.stack.4.frame.7.wind_from=qr_lookup
global.callpool.stack.4.frame.7.wind_to=FIRST_CHILD(this)->fops->lookup
global.callpool.stack.4.frame.7.unwind_to=qr_lookup_cbk

Comment 1 Anand Avati 2011-08-22 06:41:20 UTC
CHANGE: http://review.gluster.com/294 (Change-Id: I66362a3087a635fb7b759d7836a1f6564a6a7fc9) merged in master by Vijay Bellur (vijay)

Comment 2 Raghavendra Bhat 2011-08-29 04:44:58 UTC
The problem was that, earlier we used send flush on the source and then all the sinks. But before sending the flush to source, we would have cleared the pending xattrs of the sinks and thus all the sinks would also have become the sources.

So we used to send the flush only to the sources and the other stck wind to the sink (flush fop) would not happen and we would be expecting 2 unwinds for continuing. Thus the client would hang since the other stack unwind nerver happened.

But now we are keeping all the souce and sinks in the success array, and we will call stack wind of flush only once for all source as well as sink and the client will not hang.

Thus, now flush is sent on both source as well as sink. Hence the hang is not seen now.