Bug 553803
Summary: | GFS2: recovery stuck on transaction lock | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> | ||||||
Component: | kernel | Assignee: | Robert Peterson <rpeterso> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 5.5 | CC: | adas, bmarzins, brsmith, cww, dhoward, rpeterso, ssaha, swhiteho, teigland | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 672600 (view as bug list) | Environment: | |||||||
Last Closed: | 2011-07-21 10:29:01 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 570263, 590878 | ||||||||
Bug Blocks: | 672600, 733678 | ||||||||
Attachments: |
|
Description
Nate Straz
2010-01-08 22:18:31 UTC
I can't see anything wrong with this: G: s:UN n:1/2 f:l t:SH d:EX/0 l:0 a:0 r:5 H: s:SH f:epcW e:0 p:2971 [gfs2_recoverd] gfs2_recover_journal+0x182/0x814 [gfs2] H: s:SH f:W e:0 p:2988 [d_doio] gfs2_do_trans_begin+0xae/0x119 [gfs2] The question is what is holding the lock in a different mode? The only time we take the transaction lock in a mode other than shared is when freezing the fs. So it looks like another node has probably frozen the fs which is what is preventing recovery from proceeding. The other two nodes had been rebooted by revolver, morph-03 should be the only node left with valid locks. Well GFS2 thinks something else is holding that lock, so I guess the next step is to look at the dlm lock dumps in that case to see what it thinks has happened. Created attachment 382989 [details]
dlm lock dump from morph-03, morph-cluster2 lockspace
Here are the dlm locks still on morph-03.
So this is the corresponding dlm lock: Resource f1c2aac0 Name (len=24) " 1 2" Master Copy Granted Queue 012b0002 NL Conversion Queue Waiting Queue and it claims that there is one NL mode lock, which seems to be local. There is nothing in the queue for some reason. I would expect to see at least the SH lock request. Its not at all obvious what is going on here. This looks very similar to #570363. Dave, can you take a look at this and see if you can figure out whats going wrong. We have a node stuck on the transaction lock. This lock is only ever taken in the SH (PR) mode except when the fs is being frozen. Since there is no fsfreeze going on here, we'd not expect to see anything except the following transistions: new -> PR PR -> NL NL -> PR NL -> unlocked It looks like the stuck node has sent a request to the dlm, but for some strange reason it doesn't appear to show up in the dlm lock dump. Bearing in mind that this report was prior to the recent dlm fix (rhel6/upstream), I suppose that it is just possible that this is the same thing. I'm not sure how we can prove that though. I'd look at the NOEXP conditions in lock_dlm and the cases where locks granted during recovery need to be dropped by lock_dlm and reacquired. There were definately some subtle conditions and uncommon corner cases in those parts of lock_dlm -- I was pretty sure that stuff would break as a result of the lock_dlm rewrite/rework early in RHEL5 (which went along with me transfering the lock_dlm ownership to the gfs2 folks.) The main rule was that only locks used for doing journal recovery could be acquired until journal recovery was complete. If the dlm granted a lock apart from this (which happens due to dlm recovery), lock_dlm needed to release them and attempt to reacquire them. The rule is still the same, and the code in lock_dlm which deals with that is pretty much unchanged in rhel5. We've just fixed a similar (the same?) bug in rhel6 where the problem was that glocks with noexp which were queued after recovery had started were occasionally getting stuck. This was happening where the reply from the DLM arrived after recovery had started, but before the noexp request had been queued (i.e. there was a race with another lock acquiring process). That was relatively easy to fix, since the GLF_FROZEN flag indicated which glocks are frozen with pending replies. It will be more complicated to fix in rhel5 since most of the knowledge regarding frozen locks is within lock_dlm, but the place which needs to be updated in order to check for the particular condition found in rhel6 is in gfs2 (in the glock state machine). So assuming for the moment that we have the same problem here, then the fix will have to be different, and I suspect more complicated. This might be related to #656032 Bob, I'm wondering whether we should use this bz to apply the transaction lock bit of the #656032 patch. I'd prefer to use the upstream solution which is the one liner here: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=846f40455276617275284a4b76b89311b4aed0b9 There is no need to add the flush back in the umount path, because it is already there (just hidden in one of the functions that are already called in that code path). That way we can get part of the #656032 patch in sooner rather than later, since we know that this does fix a real bug. Excellent idea. Requesting ack flags to get this into a release. Created attachment 475208 [details]
Patch to fix the problem
Here is the RHEL5 crosswrite patch.
The patch was posted to rhkernel-list for inclusion into RHEL5.7. Changing status to POST. Cloned for a RHEL6 crosswrite as bug #672600 in kernel-2.6.18-242.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. I ran the load for over a week and I have not been able to reproduce this recovery error. Calling it verified on 2.6.18-261.el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html |