Bug 127031 - GFS won't mount 11th node
Summary: GFS won't mount 11th node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 3
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: michael conrad tadpol tilstra
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 137219
TreeView+ depends on / blocked
 
Reported: 2004-06-30 20:06 UTC by AJ Lewis
Modified: 2010-01-12 02:53 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-07-12 16:10:14 UTC
Embargoed:


Attachments (Terms of Use)
patch to fix 11th node hang (580 bytes, patch)
2004-07-02 20:11 UTC, AJ Lewis
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2004:379 0 normal SHIPPED_LIVE Updated GFS packages 2004-07-12 04:00:00 UTC

Description AJ Lewis 2004-06-30 20:06:16 UTC
Description of problem:
I have 12 mounters in my GFS cluster, all connected to a NetApp via
iSCSI.  A 13th node is an SLM gulm server, also connected via iSCSI to
the NetApp.  The 11th node starts the mount process, and then hangs.

There is no output in syslog on the gulm server node, or the node
whose mount is hanging.

Version-Release number of selected component (if applicable):
GFS-6.0.0-1.2 GFS-modules-smp-6.0.0-1.2

How reproducible:
Always

Steps to Reproduce:
1. Start pool, ccsd, and lock_gulm on all nodes
2. Load gfs module on all nodes
3. Mount gfs file system on all nodes in turn
    

Actual Results:  at the 11th node, the mount process hangs.

Expected Results:  All 12 nodes should have mounted the GFS file system

Additional info:

I am running the cisco iscsi initiator from
http://sourceforge.net/projects/linux-iscsi version 3.6.0.2 to connect
these nodes to a NetApp iscsi target.  All nodes can talk to the
netapp.  The order of the nodes does not matter.

Comment 3 AJ Lewis 2004-07-02 20:11:20 UTC
Created attachment 101608 [details]
patch to fix 11th node hang

Attached patch unlocks the JID journal lock on an unlock callback request,
before grabbing the shared lock again.	It does not unhold the LVB, so the
JIDcount contained within it will remain valid.

Comment 4 AJ Lewis 2004-07-05 23:27:18 UTC
Well, this solves the problem for 11 nodes if they aren't mounted simultaneously.  If they 
are mounted simultaneously, we still have the original hang/race.

Here's a more complete problem description - I believe this is correct - Mike, correct me if 
I'm wrong:

Background - in GFS 6.0, the gulm JID (Journal ID) server was removed, and a method was 
developed for keeping track of JIDs within the lock_gulm kernel module using LVBs.  The 
issue we're seeing is a race to get control of the journal header lock, which contains the 
number of JID mapping locks/lvbs that are currently allocated. (Note this does not mean 
all the mapping locks are actually mapped.)

Problem - Currently, JID mapping locks are allocated in groups of 10.  When the first 
mounter tries to get a journal ID for itself, the lock_gulm module notes that there aren't 
any free JIDs (there are 0 allocated), and allocates the original group of 10.  The next 9 
nodes have no problems, since there are already 10 JID mapping locks allocated - they 
just fill in their information.  They simply hold the journal header LVB, and grab that lock 
shared.  The 11th mounter, however, needs to expand the JID mapping space.  It calls 
jid_grow_space(), which requires an exclusive lock on the journal header lock (in order to 
change the LVB value the lock holds).

When the 11th node attempts to get the exclusive lock, a notice is sent to the first 10 
nodes that someone wants the journal header lock exclusive, so they need to drop their 
shared hold of that lock.  The callback that is registered for this unlock request, however, 
doesn't ever unlock the journal header lock.  It simply calls jid_rehold_lvbs(), which grabs 
the journal header lock shared (which it already has) so it can reread the jid count in the 
journal header lvb.  It needs to know this count, so each node can make sure they have a 
hold on the lvbs that contain the jid mappings, but since the first 10 nodes don't 
relinquish their shared hold on the journal header lock, the 11th node can't get it 
exclusive, and callbacks pile up - locking the mount process on the 11th node, and 
occasionally locking up unmount processes on some of the other 10 nodes.

The naive fix, which i submitted above, simply unlocks the journal header lock, and then 
immediately calls jid_rehold_lvbs().  This works until you get 11 *simultaneous* mounts, 
and then the race to get the journal header lock shows up again.

A temporary fix is to simply change the initial pool count to some large number, perhaps 
200 or 500, which we think is larger than any of our customer's clusters.  This will 
degrade any JID operations for small clusters, but quite possibly will not be noticeable (I 
plan on trying this tomorrow morning)

In the long term, we need to somehow guarantee the node trying to grow the JID pool (the 
11th mounter) gets the exclusive lock, and that every other node grabs the lock shared 
again and reholds all JID mapping lvbs immediately after.  If they do not, and the 11th 
mounter dies, there will be no record that the pool was grown, and the 11th JID map will 
be lost, potentially resulting in file system corruption since that journal will not be 
replayed.

There is also a very small possibility that the callbacks themselves are broken (I say a small 
possibility because I think if this were true very little would work with GFS 6.0) - if this is 
true, the code as it stands could be correct, and the bug we're seeing is merely a symptom 
of a larger problem.  This is something I need Mike's opinion on.

Comment 5 michael conrad tadpol tilstra 2004-07-06 16:10:37 UTC
Sounds right.  shame on me for forgetting to call unlock from the
callback.  The `naive' fix should be the correct one. Not sure why
simultaneous mounts are locking will dig into this.


Comment 6 michael conrad tadpol tilstra 2004-07-07 20:30:43 UTC
We have a temporary fix until a real solution can be found.
The temp fix changes the step at which the jid space is grown.  Was
10, now 300. (the initial grow from 0 is working, but following grows
are racy.)  300 was picked because this is the advertised max cluster
size.  So for the time being this should be ok.

Once a real fix is determined, we well put that in a truely mark this
as fixed.

Comment 7 Corey Marthaler 2004-07-08 20:32:25 UTC
This fix has been verified by QA 

Comment 8 David Lawrence 2004-07-12 16:10:14 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-379.html



Note You need to log in before you can comment on or make changes to this bug.