Bug 455309
Summary: | gfs umount deadlock dlm:release_lockspace | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||
Component: | gfs | Assignee: | Abhijith Das <adas> | ||||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 4 | CC: | dejohnso, edamato, jwilleford, michael.hagmann, mwhitehe, rpeterso, swhiteho, tao, teigland | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2009-05-07 21:28:31 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Corey Marthaler
2008-07-14 19:21:33 UTC
Created attachment 311756 [details]
log from grant-01
Created attachment 311757 [details]
log from grant-02
Created attachment 311758 [details]
log from grant-03
Is this reproducable, or just a one off? No new information, moved to 4.9 *** Bug 495969 has been marked as a duplicate of this bug. *** Created attachment 340011 [details]
sosreport from node135
Added sosreport from node135 from IT283633.
That errata has already been applied to this server before the crash. Is this a new manfestation of the same problem or something new? This event sent from IssueTracker by calvin_g_smith issue 283633 QA is currently attempting to reproduce this issue on two different 4.8 clusters. The last time we saw it however was 9 months ago. Created attachment 340847 [details]
Bob's Tech Notes
I spent a couple hours analyzing the crash dump. The unmount is
waiting to acquire the glock associated with syncing the statfs_fast
file. The glock shows it is "held" by a process that has already
ended as far as I can tell.
The most important thing to note: This goes back to statfs_fast=1
as we've seen before lately.
Adding myself and Dave T. to the cc list. Bottom line: We're still looking at the problem and trying to figure out what went wrong. Hopefully my attachment from comment #19 will help. FWIW - QA was unable to reproduce any problems running mount stress tests to 15 gfs on two different clusters. Hopefully you were using statfs_fast for your tests? First, it doesn't look like the stuck unmount in bug 495969 is the same as the original bug which is stuck on kthread_stop. So, let's forget about everything prior to comment 6 and consider that bug closed, we'll redefine this bz to be the new issue. Bob, I was surprised to see multiple holders on the linode glock, I only expected that the gfs_quotad thread would be taking that glock when syncing the statfs info. Looking through the code, it appears that's not quite true. If you run gfs_tool to set statfs_fast, gfs_tool will call gfs_statfs_init() -> gfs_statfs_start() which takes the glock. It doesn't appear that df would ever take it, and I can't see any other possibilities. It seems likely that the statfs_fast code is at fault somehow, and that's not used by QA AFAIK. Can you look at the dlm lockspace to see what the state of that glock is? You could also look at lock_dlm's copy. The lock_dlm dlm_lock_t struct is accessible through gl->gl_lock in the crash dump. Last night I wasn't using statfs_fast, but I hacked up the test to set it on every mount option so that it would be on during all the unmounting. I still haven't seen a unmount hang yet (it's been running for 3 hours now). Let me know if there's something else I should be trying. struct dlm_lock 0x1009eb346c0 struct dlm_lock { dlm = 0x117fd14e200, lockname = { ln_number = 25, ln_type = 2 }, lvb = 0x0, lksb = { sb_status = 0, sb_lkid = 92930403, sb_flags = 0 '\0', sb_lvbptr = 0x0 }, cur = 5, req = 5, prev_req = 0, lkf = 0, type = 0, flags = 0, bast_mode = 0, uast_wait = { done = 0, wait = { lock = { lock = 1, magic = 3735899821 }, task_list = { next = 0x1009eb34728, prev = 0x1009eb34728 } } }, clist = { next = 0x100100, prev = 0x200200 }, blist = { next = 0x100100, prev = 0x200200 }, dlist = { next = 0x100100, prev = 0x200200 }, slist = { next = 0x100100, prev = 0x200200 }, hold_null = 0x0, posix = 0x0, null_list = { next = 0x0, prev = 0x0 } } The following messages are the same issue as bug 495600: dlm: gfs0: cancel reply ret 0 lock_dlm: unlock sb_status 0 2,19 flags 0 dlm: gfs0: process_lockqueue_reply id 58a0163 state 0 Mar 30 06:18:29 -- dlm bug 495600 occured on the fast_statfs glock 2,0x19. The messages in comment 26 show this. A holder is left waiting for that unlock to complete (which never will): gl_req_gh = 0x11806b42180, gl_req_bh = 0xffffffffa025255a <drop_bh>, crash> struct gfs_holder 0x11806b42180 struct gfs_holder { gh_list = { next = 0x117f9a3f480, prev = 0x117f9a3f480 }, gh_gl = 0x117f9a3f428, gh_owner = 0x0, gh_state = 0, gh_flags = 1, gh_error = 0, gh_iflags = 52, gh_wait = { done = 0, wait = { lock = { lock = 1, magic = 3735899821 }, task_list = { next = 0x11806b421c8, prev = 0x11806b421c8 } } } } When the umount happens, the umount blocks waiting for gfs_quotad to exit, but gfs_quotad is blocked trying to get an EX lock on the fast_statfs glock which doesn't happen because the unlock above is not completing. crash> struct gfs_holder 0x117f85f9e88 struct gfs_holder { gh_list = { next = 0x117f9a3f490, prev = 0x117f9a3f490 }, gh_gl = 0x117f9a3f428, gh_owner = 0x117fa22f7f0, pid 14987 gfs_quotad gh_state = 1, gh_flags = 1056, gh_error = 0, gh_iflags = 2, gh_wait = { done = 0, wait = { lock = { lock = 1, magic = 3735899821 }, task_list = { next = 0x117f85f9da0, prev = 0x117f85f9da0 } } } } I would think we'd see a process somewhere waiting for the unlock to complete, but I don't. This is the fast_statfs glock we're talking about, though, so it could be used in some incorrect, unknown way, different from normal glocks. We're hoping that this is a duplicate of bug 495600. We're also hoping that a new fast_statfs implementation might materialize. The patch for bug #488318 will fix this problem. Is the customer willing to try the patch for bug #488318? *** This bug has been marked as a duplicate of bug 488318 *** |