Bug 127031
Summary: | GFS won't mount 11th node | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | AJ Lewis <157070.alewis> | ||||
Component: | gfs | Assignee: | michael conrad tadpol tilstra <mtilstra> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3 | CC: | amanthei, kanderso | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-07-12 16:10:14 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 137219 | ||||||
Attachments: |
|
Description
AJ Lewis
2004-06-30 20:06:16 UTC
Created attachment 101608 [details]
patch to fix 11th node hang
Attached patch unlocks the JID journal lock on an unlock callback request,
before grabbing the shared lock again. It does not unhold the LVB, so the
JIDcount contained within it will remain valid.
Well, this solves the problem for 11 nodes if they aren't mounted simultaneously. If they are mounted simultaneously, we still have the original hang/race. Here's a more complete problem description - I believe this is correct - Mike, correct me if I'm wrong: Background - in GFS 6.0, the gulm JID (Journal ID) server was removed, and a method was developed for keeping track of JIDs within the lock_gulm kernel module using LVBs. The issue we're seeing is a race to get control of the journal header lock, which contains the number of JID mapping locks/lvbs that are currently allocated. (Note this does not mean all the mapping locks are actually mapped.) Problem - Currently, JID mapping locks are allocated in groups of 10. When the first mounter tries to get a journal ID for itself, the lock_gulm module notes that there aren't any free JIDs (there are 0 allocated), and allocates the original group of 10. The next 9 nodes have no problems, since there are already 10 JID mapping locks allocated - they just fill in their information. They simply hold the journal header LVB, and grab that lock shared. The 11th mounter, however, needs to expand the JID mapping space. It calls jid_grow_space(), which requires an exclusive lock on the journal header lock (in order to change the LVB value the lock holds). When the 11th node attempts to get the exclusive lock, a notice is sent to the first 10 nodes that someone wants the journal header lock exclusive, so they need to drop their shared hold of that lock. The callback that is registered for this unlock request, however, doesn't ever unlock the journal header lock. It simply calls jid_rehold_lvbs(), which grabs the journal header lock shared (which it already has) so it can reread the jid count in the journal header lvb. It needs to know this count, so each node can make sure they have a hold on the lvbs that contain the jid mappings, but since the first 10 nodes don't relinquish their shared hold on the journal header lock, the 11th node can't get it exclusive, and callbacks pile up - locking the mount process on the 11th node, and occasionally locking up unmount processes on some of the other 10 nodes. The naive fix, which i submitted above, simply unlocks the journal header lock, and then immediately calls jid_rehold_lvbs(). This works until you get 11 *simultaneous* mounts, and then the race to get the journal header lock shows up again. A temporary fix is to simply change the initial pool count to some large number, perhaps 200 or 500, which we think is larger than any of our customer's clusters. This will degrade any JID operations for small clusters, but quite possibly will not be noticeable (I plan on trying this tomorrow morning) In the long term, we need to somehow guarantee the node trying to grow the JID pool (the 11th mounter) gets the exclusive lock, and that every other node grabs the lock shared again and reholds all JID mapping lvbs immediately after. If they do not, and the 11th mounter dies, there will be no record that the pool was grown, and the 11th JID map will be lost, potentially resulting in file system corruption since that journal will not be replayed. There is also a very small possibility that the callbacks themselves are broken (I say a small possibility because I think if this were true very little would work with GFS 6.0) - if this is true, the code as it stands could be correct, and the bug we're seeing is merely a symptom of a larger problem. This is something I need Mike's opinion on. Sounds right. shame on me for forgetting to call unlock from the callback. The `naive' fix should be the correct one. Not sure why simultaneous mounts are locking will dig into this. We have a temporary fix until a real solution can be found. The temp fix changes the step at which the jid space is grown. Was 10, now 300. (the initial grow from 0 is working, but following grows are racy.) 300 was picked because this is the advertised max cluster size. So for the time being this should be ok. Once a real fix is determined, we well put that in a truely mark this as fixed. This fix has been verified by QA An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-379.html |