682951 – GFS2: umount stuck on gfs2_gl_hash_clear

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 682951 - GFS2: umount stuck on gfs2_gl_hash_clear

Summary: GFS2: umount stuck on gfs2_gl_hash_clear

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Steve Whitehouse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	635041
TreeView+	depends on / blocked

Reported:	2011-03-08 05:29 UTC by Nate Straz
Modified:	2011-05-23 20:43 UTC (History)
CC List:	4 users (show)
Fixed In Version:	kernel-2.6.32-125.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	803384 (view as bug list)
Environment:
Last Closed:	2011-05-23 20:43:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
trace logs from all buzz nodes during umount hang (9.29 MB, application/x-gzip) 2011-03-08 16:02 UTC, Nate Straz	no flags	Details
Proposed fix (upstream) (2.76 KB, patch) 2011-03-09 10:22 UTC, Steve Whitehouse	no flags	Details \| Diff
Proposed fix (RHEL6) (2.76 KB, patch) 2011-03-09 10:22 UTC, Steve Whitehouse	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0542	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update	2011-05-19 11:58:07 UTC

Description Nate Straz 2011-03-08 05:29:26 UTC

Description of problem:

When trying to umount after running a brawl scenario (a combination of local and shared workloads), the umount hangs and the hang checker prints out this backtrace every two minutes:


INFO: task umount:11506 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 000000000000000d     0 11506  11505 0x00000080
 ffff8805ec0e9d98 0000000000000082 0000000000000000 ffff8803221a77c0
 ffff8805ec0e9d08 ffffffff814d998d ffff8805ec0e9d68 0000000100480b16
 ffff88061ddfc6b8 ffff8805ec0e9fd8 000000000000f558 ffff88061ddfc6b8
Call Trace:
 [<ffffffff814d998d>] ? wait_for_completion+0x1d/0x20
 [<ffffffffa033cd6d>] gfs2_gl_hash_clear+0x7d/0xc0 [gfs2]
 [<ffffffff8108dce0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0359dfb>] gfs2_put_super+0x17b/0x220 [gfs2]
 [<ffffffff81173016>] generic_shutdown_super+0x56/0xe0
 [<ffffffff811730d1>] kill_block_super+0x31/0x50
 [<ffffffffa034b0e1>] gfs2_kill_sb+0x61/0x90 [gfs2]
 [<ffffffff81174180>] deactivate_super+0x70/0x90
 [<ffffffff8118f63f>] mntput_no_expire+0xbf/0x110
 [<ffffffff8118fa6b>] sys_umount+0x7b/0x3a0
 [<ffffffff810d1652>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

I didn't see any other information in the logs.

Version-Release number of selected component (if applicable):
2.6.32-119.el6bz635041b.x86_64

How reproducible:
I've hit this a few times on two different clusters

Steps to Reproduce:
Hit during a brawl run in the umount step
  
Actual results:


Expected results:


Additional info:

Comment 2 Steve Whitehouse 2011-03-08 12:27:16 UTC

Does this only occur with the one single kernel version mentioned above, or with other kernel versions too?

Comment 3 Steve Whitehouse 2011-03-08 12:33:40 UTC

Also, a glock dump would be helpful. The glock debugfs file should still be accessible at that point in time. The fs is bascially waiting for all replies to arrive from the dlm at that stage, and it will sit there until they have all arrived.

If it is reproduceable, then a trace from the glock tracepoints would be very helpful in tracking down the issue.

Comment 4 Nate Straz 2011-03-08 14:12:44 UTC

The glock debugfs file on buzz-01 (where the umount is stuck) is empty.  No waiters were found on any of the other nodes.  Tracing was not enabled before the test case.  I'll make sure I enable that as I retry on other kernels.

Comment 5 Nate Straz 2011-03-08 14:20:48 UTC

Just a data point, I'm doing a git bisect between -114.el6 and -119.el6 for an unrelated issue on the dash nodes.  I was able to reproduce this umount hang there with a -117.el6 based kernel.

Comment 6 Steve Whitehouse 2011-03-08 14:23:33 UTC

If the debugfs file is empty, then that is a good indication that gfs2 has sent unlock requests to all of the glocks. There is no easy way to get at the counter which is checking to ensure that all glocks are freed, unfortunately. The waiting should end when that counter hits zero.

Comment 7 Steve Whitehouse 2011-03-08 14:28:41 UTC

Actually, I have a thought... from glock.c:

106 void gfs2_glock_free(struct rcu_head *rcu)
107 {
108         struct gfs2_glock *gl = container_of(rcu, struct gfs2_glock, gl_rcu);
109         struct gfs2_sbd *sdp = gl->gl_sbd;
110
111         if (gl->gl_ops->go_flags & GLOF_ASPACE)
112                 kmem_cache_free(gfs2_glock_aspace_cachep, gl);
113         else
114                 kmem_cache_free(gfs2_glock_cachep, gl);
115
116         if (atomic_dec_and_test(&sdp->sd_glock_disposal))
117                 wake_up(&sdp->sd_glock_wait);
118 }

I wonder whether the problem is that we need to ensure that RCU flushes out its list of glocks. It might be stuck waiting for that to happen. If so, then we can either move the code which does the wake up to before the call_rcu() or try to add an rcu_synchronize() into the code somewhere....

Comment 8 Nate Straz 2011-03-08 16:02:55 UTC

Created attachment 482938 [details]
trace logs from all buzz nodes during umount hang

Comment 9 Steve Whitehouse 2011-03-09 10:22:02 UTC

Created attachment 483148 [details]
Proposed fix (upstream)

Comment 10 Steve Whitehouse 2011-03-09 10:22:32 UTC

Created attachment 483149 [details]
Proposed fix (RHEL6)

Comment 11 RHEL Program Management 2011-03-09 11:40:00 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 12 Nate Straz 2011-03-14 14:35:24 UTC

I built and tested a -119.el6 based kernel with patches from 635041 and 682951.  I ran on two clusters for about 4 days and was not able to hit either issue.  Both patches have passed QE testing.

Comment 13 Steve Whitehouse 2011-03-14 14:44:43 UTC

Thanks for the confirmation that this fixed the problem.

Comment 14 Aristeu Rozanski 2011-03-22 14:50:57 UTC

Patch(es) available on kernel-2.6.32-125.el6

Comment 17 Nate Straz 2011-03-25 22:06:26 UTC

Made it through a brawl run w/ -125.el6 kernel without hitting this on umount.

Comment 18 errata-xmlrpc 2011-05-23 20:43:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Note You need to log in before you can comment on or make changes to this bug.