Bug 682951
Summary: | GFS2: umount stuck on gfs2_gl_hash_clear | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Nate Straz <nstraz> | ||||||||
Component: | kernel | Assignee: | Steve Whitehouse <swhiteho> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.1 | CC: | adas, bmarzins, rpeterso, rwheeler | ||||||||
Target Milestone: | rc | Keywords: | Regression, TestBlocker | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | kernel-2.6.32-125.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 803384 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-05-23 20:43:15 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 635041 | ||||||||||
Attachments: |
|
Description
Nate Straz
2011-03-08 05:29:26 UTC
Does this only occur with the one single kernel version mentioned above, or with other kernel versions too? Also, a glock dump would be helpful. The glock debugfs file should still be accessible at that point in time. The fs is bascially waiting for all replies to arrive from the dlm at that stage, and it will sit there until they have all arrived. If it is reproduceable, then a trace from the glock tracepoints would be very helpful in tracking down the issue. The glock debugfs file on buzz-01 (where the umount is stuck) is empty. No waiters were found on any of the other nodes. Tracing was not enabled before the test case. I'll make sure I enable that as I retry on other kernels. Just a data point, I'm doing a git bisect between -114.el6 and -119.el6 for an unrelated issue on the dash nodes. I was able to reproduce this umount hang there with a -117.el6 based kernel. If the debugfs file is empty, then that is a good indication that gfs2 has sent unlock requests to all of the glocks. There is no easy way to get at the counter which is checking to ensure that all glocks are freed, unfortunately. The waiting should end when that counter hits zero. Actually, I have a thought... from glock.c: 106 void gfs2_glock_free(struct rcu_head *rcu) 107 { 108 struct gfs2_glock *gl = container_of(rcu, struct gfs2_glock, gl_rcu); 109 struct gfs2_sbd *sdp = gl->gl_sbd; 110 111 if (gl->gl_ops->go_flags & GLOF_ASPACE) 112 kmem_cache_free(gfs2_glock_aspace_cachep, gl); 113 else 114 kmem_cache_free(gfs2_glock_cachep, gl); 115 116 if (atomic_dec_and_test(&sdp->sd_glock_disposal)) 117 wake_up(&sdp->sd_glock_wait); 118 } I wonder whether the problem is that we need to ensure that RCU flushes out its list of glocks. It might be stuck waiting for that to happen. If so, then we can either move the code which does the wake up to before the call_rcu() or try to add an rcu_synchronize() into the code somewhere.... Created attachment 482938 [details]
trace logs from all buzz nodes during umount hang
Created attachment 483148 [details]
Proposed fix (upstream)
Created attachment 483149 [details]
Proposed fix (RHEL6)
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. I built and tested a -119.el6 based kernel with patches from 635041 and 682951. I ran on two clusters for about 4 days and was not able to hit either issue. Both patches have passed QE testing. Thanks for the confirmation that this fixed the problem. Patch(es) available on kernel-2.6.32-125.el6 Made it through a brawl run w/ -125.el6 kernel without hitting this on umount. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html |