Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 682951 - GFS2: umount stuck on gfs2_gl_hash_clear
GFS2: umount stuck on gfs2_gl_hash_clear
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.1
Unspecified Unspecified
high Severity high
: rc
: ---
Assigned To: Steve Whitehouse
Cluster QE
: Regression, TestBlocker
Depends On:
Blocks: 635041
  Show dependency treegraph
 
Reported: 2011-03-08 00:29 EST by Nate Straz
Modified: 2011-05-23 16:43 EDT (History)
4 users (show)

See Also:
Fixed In Version: kernel-2.6.32-125.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 803384 (view as bug list)
Environment:
Last Closed: 2011-05-23 16:43:15 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
trace logs from all buzz nodes during umount hang (9.29 MB, application/x-gzip)
2011-03-08 11:02 EST, Nate Straz
no flags Details
Proposed fix (upstream) (2.76 KB, patch)
2011-03-09 05:22 EST, Steve Whitehouse
no flags Details | Diff
Proposed fix (RHEL6) (2.76 KB, patch)
2011-03-09 05:22 EST, Steve Whitehouse
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0542 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update 2011-05-19 07:58:07 EDT

  None (edit)
Description Nate Straz 2011-03-08 00:29:26 EST
Description of problem:

When trying to umount after running a brawl scenario (a combination of local and shared workloads), the umount hangs and the hang checker prints out this backtrace every two minutes:


INFO: task umount:11506 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 000000000000000d     0 11506  11505 0x00000080
 ffff8805ec0e9d98 0000000000000082 0000000000000000 ffff8803221a77c0
 ffff8805ec0e9d08 ffffffff814d998d ffff8805ec0e9d68 0000000100480b16
 ffff88061ddfc6b8 ffff8805ec0e9fd8 000000000000f558 ffff88061ddfc6b8
Call Trace:
 [<ffffffff814d998d>] ? wait_for_completion+0x1d/0x20
 [<ffffffffa033cd6d>] gfs2_gl_hash_clear+0x7d/0xc0 [gfs2]
 [<ffffffff8108dce0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0359dfb>] gfs2_put_super+0x17b/0x220 [gfs2]
 [<ffffffff81173016>] generic_shutdown_super+0x56/0xe0
 [<ffffffff811730d1>] kill_block_super+0x31/0x50
 [<ffffffffa034b0e1>] gfs2_kill_sb+0x61/0x90 [gfs2]
 [<ffffffff81174180>] deactivate_super+0x70/0x90
 [<ffffffff8118f63f>] mntput_no_expire+0xbf/0x110
 [<ffffffff8118fa6b>] sys_umount+0x7b/0x3a0
 [<ffffffff810d1652>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

I didn't see any other information in the logs.

Version-Release number of selected component (if applicable):
2.6.32-119.el6bz635041b.x86_64

How reproducible:
I've hit this a few times on two different clusters

Steps to Reproduce:
Hit during a brawl run in the umount step
  
Actual results:


Expected results:


Additional info:
Comment 2 Steve Whitehouse 2011-03-08 07:27:16 EST
Does this only occur with the one single kernel version mentioned above, or with other kernel versions too?
Comment 3 Steve Whitehouse 2011-03-08 07:33:40 EST
Also, a glock dump would be helpful. The glock debugfs file should still be accessible at that point in time. The fs is bascially waiting for all replies to arrive from the dlm at that stage, and it will sit there until they have all arrived.

If it is reproduceable, then a trace from the glock tracepoints would be very helpful in tracking down the issue.
Comment 4 Nate Straz 2011-03-08 09:12:44 EST
The glock debugfs file on buzz-01 (where the umount is stuck) is empty.  No waiters were found on any of the other nodes.  Tracing was not enabled before the test case.  I'll make sure I enable that as I retry on other kernels.
Comment 5 Nate Straz 2011-03-08 09:20:48 EST
Just a data point, I'm doing a git bisect between -114.el6 and -119.el6 for an unrelated issue on the dash nodes.  I was able to reproduce this umount hang there with a -117.el6 based kernel.
Comment 6 Steve Whitehouse 2011-03-08 09:23:33 EST
If the debugfs file is empty, then that is a good indication that gfs2 has sent unlock requests to all of the glocks. There is no easy way to get at the counter which is checking to ensure that all glocks are freed, unfortunately. The waiting should end when that counter hits zero.
Comment 7 Steve Whitehouse 2011-03-08 09:28:41 EST
Actually, I have a thought... from glock.c:

106 void gfs2_glock_free(struct rcu_head *rcu)
107 {
108         struct gfs2_glock *gl = container_of(rcu, struct gfs2_glock, gl_rcu);
109         struct gfs2_sbd *sdp = gl->gl_sbd;
110
111         if (gl->gl_ops->go_flags & GLOF_ASPACE)
112                 kmem_cache_free(gfs2_glock_aspace_cachep, gl);
113         else
114                 kmem_cache_free(gfs2_glock_cachep, gl);
115
116         if (atomic_dec_and_test(&sdp->sd_glock_disposal))
117                 wake_up(&sdp->sd_glock_wait);
118 }

I wonder whether the problem is that we need to ensure that RCU flushes out its list of glocks. It might be stuck waiting for that to happen. If so, then we can either move the code which does the wake up to before the call_rcu() or try to add an rcu_synchronize() into the code somewhere....
Comment 8 Nate Straz 2011-03-08 11:02:55 EST
Created attachment 482938 [details]
trace logs from all buzz nodes during umount hang
Comment 9 Steve Whitehouse 2011-03-09 05:22:02 EST
Created attachment 483148 [details]
Proposed fix (upstream)
Comment 10 Steve Whitehouse 2011-03-09 05:22:32 EST
Created attachment 483149 [details]
Proposed fix (RHEL6)
Comment 11 RHEL Product and Program Management 2011-03-09 06:40:00 EST
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.
Comment 12 Nate Straz 2011-03-14 10:35:24 EDT
I built and tested a -119.el6 based kernel with patches from 635041 and 682951.  I ran on two clusters for about 4 days and was not able to hit either issue.  Both patches have passed QE testing.
Comment 13 Steve Whitehouse 2011-03-14 10:44:43 EDT
Thanks for the confirmation that this fixed the problem.
Comment 14 Aristeu Rozanski 2011-03-22 10:50:57 EDT
Patch(es) available on kernel-2.6.32-125.el6
Comment 17 Nate Straz 2011-03-25 18:06:26 EDT
Made it through a brawl run w/ -125.el6 kernel without hitting this on umount.
Comment 18 errata-xmlrpc 2011-05-23 16:43:15 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Note You need to log in before you can comment on or make changes to this bug.