127031 – GFS won't mount 11th node

Bug 127031 - GFS won't mount 11th node

Summary: GFS won't mount 11th node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	michael conrad tadpol tilstra
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	137219
TreeView+	depends on / blocked

Reported:	2004-06-30 20:06 UTC by AJ Lewis
Modified:	2010-01-12 02:53 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-07-12 16:10:14 UTC
Embargoed:

Attachments	(Terms of Use)
patch to fix 11th node hang (580 bytes, patch) 2004-07-02 20:11 UTC, AJ Lewis	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:379	0	normal	SHIPPED_LIVE	Updated GFS packages	2004-07-12 04:00:00 UTC

Description AJ Lewis 2004-06-30 20:06:16 UTC

Description of problem:
I have 12 mounters in my GFS cluster, all connected to a NetApp via
iSCSI.  A 13th node is an SLM gulm server, also connected via iSCSI to
the NetApp.  The 11th node starts the mount process, and then hangs.

There is no output in syslog on the gulm server node, or the node
whose mount is hanging.

Version-Release number of selected component (if applicable):
GFS-6.0.0-1.2 GFS-modules-smp-6.0.0-1.2

How reproducible:
Always

Steps to Reproduce:
1. Start pool, ccsd, and lock_gulm on all nodes
2. Load gfs module on all nodes
3. Mount gfs file system on all nodes in turn
    

Actual Results:  at the 11th node, the mount process hangs.

Expected Results:  All 12 nodes should have mounted the GFS file system

Additional info:

I am running the cisco iscsi initiator from
http://sourceforge.net/projects/linux-iscsi version 3.6.0.2 to connect
these nodes to a NetApp iscsi target.  All nodes can talk to the
netapp.  The order of the nodes does not matter.

Comment 3 AJ Lewis 2004-07-02 20:11:20 UTC

Created attachment 101608 [details]
patch to fix 11th node hang

Attached patch unlocks the JID journal lock on an unlock callback request,
before grabbing the shared lock again.	It does not unhold the LVB, so the
JIDcount contained within it will remain valid.

Comment 4 AJ Lewis 2004-07-05 23:27:18 UTC

Well, this solves the problem for 11 nodes if they aren't mounted simultaneously. If they
are mounted simultaneously, we still have the original hang/race.

Here's a more complete problem description - I believe this is correct - Mike, correct me if
I'm wrong:

Background - in GFS 6.0, the gulm JID (Journal ID) server was removed, and a method was
developed for keeping track of JIDs within the lock_gulm kernel module using LVBs. The
issue we're seeing is a race to get control of the journal header lock, which contains the
number of JID mapping locks/lvbs that are currently allocated. (Note this does not mean
all the mapping locks are actually mapped.)

Problem - Currently, JID mapping locks are allocated in groups of 10. When the first
mounter tries to get a journal ID for itself, the lock_gulm module notes that there aren't
any free JIDs (there are 0 allocated), and allocates the original group of 10. The next 9
nodes have no problems, since there are already 10 JID mapping locks allocated - they
just fill in their information. They simply hold the journal header LVB, and grab that lock
shared. The 11th mounter, however, needs to expand the JID mapping space. It calls
jid_grow_space(), which requires an exclusive lock on the journal header lock (in order to
change the LVB value the lock holds).

When the 11th node attempts to get the exclusive lock, a notice is sent to the first 10
nodes that someone wants the journal header lock exclusive, so they need to drop their
shared hold of that lock. The callback that is registered for this unlock request, however,
doesn't ever unlock the journal header lock. It simply calls jid_rehold_lvbs(), which grabs
the journal header lock shared (which it already has) so it can reread the jid count in the
journal header lvb. It needs to know this count, so each node can make sure they have a
hold on the lvbs that contain the jid mappings, but since the first 10 nodes don't
relinquish their shared hold on the journal header lock, the 11th node can't get it
exclusive, and callbacks pile up - locking the mount process on the 11th node, and
occasionally locking up unmount processes on some of the other 10 nodes.

The naive fix, which i submitted above, simply unlocks the journal header lock, and then
immediately calls jid_rehold_lvbs(). This works until you get 11 *simultaneous* mounts,
and then the race to get the journal header lock shows up again.

A temporary fix is to simply change the initial pool count to some large number, perhaps
200 or 500, which we think is larger than any of our customer's clusters. This will
degrade any JID operations for small clusters, but quite possibly will not be noticeable (I
plan on trying this tomorrow morning)

In the long term, we need to somehow guarantee the node trying to grow the JID pool (the
11th mounter) gets the exclusive lock, and that every other node grabs the lock shared
again and reholds all JID mapping lvbs immediately after. If they do not, and the 11th
mounter dies, there will be no record that the pool was grown, and the 11th JID map will
be lost, potentially resulting in file system corruption since that journal will not be
replayed.

There is also a very small possibility that the callbacks themselves are broken (I say a small
possibility because I think if this were true very little would work with GFS 6.0) - if this is
true, the code as it stands could be correct, and the bug we're seeing is merely a symptom
of a larger problem. This is something I need Mike's opinion on.

Comment 5 michael conrad tadpol tilstra 2004-07-06 16:10:37 UTC

Sounds right.  shame on me for forgetting to call unlock from the
callback.  The `naive' fix should be the correct one. Not sure why
simultaneous mounts are locking will dig into this.

Comment 6 michael conrad tadpol tilstra 2004-07-07 20:30:43 UTC

We have a temporary fix until a real solution can be found.
The temp fix changes the step at which the jid space is grown.  Was
10, now 300. (the initial grow from 0 is working, but following grows
are racy.)  300 was picked because this is the advertised max cluster
size.  So for the time being this should be ok.

Once a real fix is determined, we well put that in a truely mark this
as fixed.

Comment 7 Corey Marthaler 2004-07-08 20:32:25 UTC

This fix has been verified by QA

Comment 8 David Lawrence 2004-07-12 16:10:14 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-379.html

Note You need to log in before you can comment on or make changes to this bug.