Bug 682951

Summary: GFS2: umount stuck on gfs2_gl_hash_clear
Product: Red Hat Enterprise Linux 6 Reporter: Nate Straz <nstraz>
Component: kernelAssignee: Steve Whitehouse <swhiteho>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: adas, bmarzins, rpeterso, rwheeler
Target Milestone: rcKeywords: Regression, TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.32-125.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 803384 (view as bug list) Environment:
Last Closed: 2011-05-23 20:43:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 635041    
Attachments:
Description Flags
trace logs from all buzz nodes during umount hang
none
Proposed fix (upstream)
none
Proposed fix (RHEL6) none

Description Nate Straz 2011-03-08 05:29:26 UTC
Description of problem:

When trying to umount after running a brawl scenario (a combination of local and shared workloads), the umount hangs and the hang checker prints out this backtrace every two minutes:


INFO: task umount:11506 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 000000000000000d     0 11506  11505 0x00000080
 ffff8805ec0e9d98 0000000000000082 0000000000000000 ffff8803221a77c0
 ffff8805ec0e9d08 ffffffff814d998d ffff8805ec0e9d68 0000000100480b16
 ffff88061ddfc6b8 ffff8805ec0e9fd8 000000000000f558 ffff88061ddfc6b8
Call Trace:
 [<ffffffff814d998d>] ? wait_for_completion+0x1d/0x20
 [<ffffffffa033cd6d>] gfs2_gl_hash_clear+0x7d/0xc0 [gfs2]
 [<ffffffff8108dce0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0359dfb>] gfs2_put_super+0x17b/0x220 [gfs2]
 [<ffffffff81173016>] generic_shutdown_super+0x56/0xe0
 [<ffffffff811730d1>] kill_block_super+0x31/0x50
 [<ffffffffa034b0e1>] gfs2_kill_sb+0x61/0x90 [gfs2]
 [<ffffffff81174180>] deactivate_super+0x70/0x90
 [<ffffffff8118f63f>] mntput_no_expire+0xbf/0x110
 [<ffffffff8118fa6b>] sys_umount+0x7b/0x3a0
 [<ffffffff810d1652>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

I didn't see any other information in the logs.

Version-Release number of selected component (if applicable):
2.6.32-119.el6bz635041b.x86_64

How reproducible:
I've hit this a few times on two different clusters

Steps to Reproduce:
Hit during a brawl run in the umount step
  
Actual results:


Expected results:


Additional info:

Comment 2 Steve Whitehouse 2011-03-08 12:27:16 UTC
Does this only occur with the one single kernel version mentioned above, or with other kernel versions too?

Comment 3 Steve Whitehouse 2011-03-08 12:33:40 UTC
Also, a glock dump would be helpful. The glock debugfs file should still be accessible at that point in time. The fs is bascially waiting for all replies to arrive from the dlm at that stage, and it will sit there until they have all arrived.

If it is reproduceable, then a trace from the glock tracepoints would be very helpful in tracking down the issue.

Comment 4 Nate Straz 2011-03-08 14:12:44 UTC
The glock debugfs file on buzz-01 (where the umount is stuck) is empty.  No waiters were found on any of the other nodes.  Tracing was not enabled before the test case.  I'll make sure I enable that as I retry on other kernels.

Comment 5 Nate Straz 2011-03-08 14:20:48 UTC
Just a data point, I'm doing a git bisect between -114.el6 and -119.el6 for an unrelated issue on the dash nodes.  I was able to reproduce this umount hang there with a -117.el6 based kernel.

Comment 6 Steve Whitehouse 2011-03-08 14:23:33 UTC
If the debugfs file is empty, then that is a good indication that gfs2 has sent unlock requests to all of the glocks. There is no easy way to get at the counter which is checking to ensure that all glocks are freed, unfortunately. The waiting should end when that counter hits zero.

Comment 7 Steve Whitehouse 2011-03-08 14:28:41 UTC
Actually, I have a thought... from glock.c:

106 void gfs2_glock_free(struct rcu_head *rcu)
107 {
108         struct gfs2_glock *gl = container_of(rcu, struct gfs2_glock, gl_rcu);
109         struct gfs2_sbd *sdp = gl->gl_sbd;
110
111         if (gl->gl_ops->go_flags & GLOF_ASPACE)
112                 kmem_cache_free(gfs2_glock_aspace_cachep, gl);
113         else
114                 kmem_cache_free(gfs2_glock_cachep, gl);
115
116         if (atomic_dec_and_test(&sdp->sd_glock_disposal))
117                 wake_up(&sdp->sd_glock_wait);
118 }

I wonder whether the problem is that we need to ensure that RCU flushes out its list of glocks. It might be stuck waiting for that to happen. If so, then we can either move the code which does the wake up to before the call_rcu() or try to add an rcu_synchronize() into the code somewhere....

Comment 8 Nate Straz 2011-03-08 16:02:55 UTC
Created attachment 482938 [details]
trace logs from all buzz nodes during umount hang

Comment 9 Steve Whitehouse 2011-03-09 10:22:02 UTC
Created attachment 483148 [details]
Proposed fix (upstream)

Comment 10 Steve Whitehouse 2011-03-09 10:22:32 UTC
Created attachment 483149 [details]
Proposed fix (RHEL6)

Comment 11 RHEL Program Management 2011-03-09 11:40:00 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 12 Nate Straz 2011-03-14 14:35:24 UTC
I built and tested a -119.el6 based kernel with patches from 635041 and 682951.  I ran on two clusters for about 4 days and was not able to hit either issue.  Both patches have passed QE testing.

Comment 13 Steve Whitehouse 2011-03-14 14:44:43 UTC
Thanks for the confirmation that this fixed the problem.

Comment 14 Aristeu Rozanski 2011-03-22 14:50:57 UTC
Patch(es) available on kernel-2.6.32-125.el6

Comment 17 Nate Straz 2011-03-25 22:06:26 UTC
Made it through a brawl run w/ -125.el6 kernel without hitting this on umount.

Comment 18 errata-xmlrpc 2011-05-23 20:43:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html