Description of problem:
I have 12 mounters in my GFS cluster, all connected to a NetApp via
iSCSI. A 13th node is an SLM gulm server, also connected via iSCSI to
the NetApp. The 11th node starts the mount process, and then hangs.
There is no output in syslog on the gulm server node, or the node
whose mount is hanging.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Start pool, ccsd, and lock_gulm on all nodes
2. Load gfs module on all nodes
3. Mount gfs file system on all nodes in turn
Actual Results: at the 11th node, the mount process hangs.
Expected Results: All 12 nodes should have mounted the GFS file system
I am running the cisco iscsi initiator from
http://sourceforge.net/projects/linux-iscsi version 188.8.131.52 to connect
these nodes to a NetApp iscsi target. All nodes can talk to the
netapp. The order of the nodes does not matter.
Created attachment 101608 [details]
patch to fix 11th node hang
Attached patch unlocks the JID journal lock on an unlock callback request,
before grabbing the shared lock again. It does not unhold the LVB, so the
JIDcount contained within it will remain valid.
Well, this solves the problem for 11 nodes if they aren't mounted simultaneously. If they
are mounted simultaneously, we still have the original hang/race.
Here's a more complete problem description - I believe this is correct - Mike, correct me if
Background - in GFS 6.0, the gulm JID (Journal ID) server was removed, and a method was
developed for keeping track of JIDs within the lock_gulm kernel module using LVBs. The
issue we're seeing is a race to get control of the journal header lock, which contains the
number of JID mapping locks/lvbs that are currently allocated. (Note this does not mean
all the mapping locks are actually mapped.)
Problem - Currently, JID mapping locks are allocated in groups of 10. When the first
mounter tries to get a journal ID for itself, the lock_gulm module notes that there aren't
any free JIDs (there are 0 allocated), and allocates the original group of 10. The next 9
nodes have no problems, since there are already 10 JID mapping locks allocated - they
just fill in their information. They simply hold the journal header LVB, and grab that lock
shared. The 11th mounter, however, needs to expand the JID mapping space. It calls
jid_grow_space(), which requires an exclusive lock on the journal header lock (in order to
change the LVB value the lock holds).
When the 11th node attempts to get the exclusive lock, a notice is sent to the first 10
nodes that someone wants the journal header lock exclusive, so they need to drop their
shared hold of that lock. The callback that is registered for this unlock request, however,
doesn't ever unlock the journal header lock. It simply calls jid_rehold_lvbs(), which grabs
the journal header lock shared (which it already has) so it can reread the jid count in the
journal header lvb. It needs to know this count, so each node can make sure they have a
hold on the lvbs that contain the jid mappings, but since the first 10 nodes don't
relinquish their shared hold on the journal header lock, the 11th node can't get it
exclusive, and callbacks pile up - locking the mount process on the 11th node, and
occasionally locking up unmount processes on some of the other 10 nodes.
The naive fix, which i submitted above, simply unlocks the journal header lock, and then
immediately calls jid_rehold_lvbs(). This works until you get 11 *simultaneous* mounts,
and then the race to get the journal header lock shows up again.
A temporary fix is to simply change the initial pool count to some large number, perhaps
200 or 500, which we think is larger than any of our customer's clusters. This will
degrade any JID operations for small clusters, but quite possibly will not be noticeable (I
plan on trying this tomorrow morning)
In the long term, we need to somehow guarantee the node trying to grow the JID pool (the
11th mounter) gets the exclusive lock, and that every other node grabs the lock shared
again and reholds all JID mapping lvbs immediately after. If they do not, and the 11th
mounter dies, there will be no record that the pool was grown, and the 11th JID map will
be lost, potentially resulting in file system corruption since that journal will not be
There is also a very small possibility that the callbacks themselves are broken (I say a small
possibility because I think if this were true very little would work with GFS 6.0) - if this is
true, the code as it stands could be correct, and the bug we're seeing is merely a symptom
of a larger problem. This is something I need Mike's opinion on.
Sounds right. shame on me for forgetting to call unlock from the
callback. The `naive' fix should be the correct one. Not sure why
simultaneous mounts are locking will dig into this.
We have a temporary fix until a real solution can be found.
The temp fix changes the step at which the jid space is grown. Was
10, now 300. (the initial grow from 0 is working, but following grows
are racy.) 300 was picked because this is the advertised max cluster
size. So for the time being this should be ok.
Once a real fix is determined, we well put that in a truely mark this
This fix has been verified by QA
An errata has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.