Description of problem:
If a node holding the listlock for a filesystem expires, recovery and
mounting of that filesystem will hang. This is because inorder to
clean up the lock space for GFS, you must first access the lock space
for GFS (which is where the jid mappings are stored). Since these jid
specific locks are being accessed without the IgnoreExpire flag set,
they will block untill the holder of the listlock is replayed, which
can not happen unless the listlock is held.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Mount a GFS filesystem on Node1
2. Cause Node2 to expire
3. Crash Node1 while it holds the listlock
4. GFS filesystem is now hung. No other node can mount that
filesystem or recover that filesystem until the entire lockspace is
cleaned (all lock servers are shutdown and then restarted) or the
filesystem is given a new name.
I ran into this problem twice while running some recovery tests. Both
cases required that the nodes got bounced rather frequently (every
Created attachment 102993 [details]
insert a couple breakpoints into lock_gulm.o for testing
The attached patch creates some breakpoints for reproducing the bug. Two
reproduce the bug with this patch:
1. Mount GFS on Node1
2. Load lock_gulm.o with the breakpoint number on Node2
`insmod gulm_breakpoint=1 lock_gulm.o`
3. Crash Node1
4. Node2 will now panic when trying to recover Node1. Once this happens, no
new nodes can mount (any other node that may have been mounted at the time
will not be able to replay the journal for Node1 or Node2 either)
Created attachment 103002 [details]
make jid mappings ignore expired state of locks
The attached patch makes the jid mapping requests use the
lg_lock_flag_IgnoreExp (ignore expired flag) when aquiring the listlock and
journal locks. I've done some rather basic testing and it seems to work. I'm
waiting for Mike Tilstra to review the code once he returns back from vacation.
(Does this need to also be set when dealing with unlock requests? I don't
think it does since it means that the holder of the lock is trying to unlock
it's locks while expired, which is just not allowed)
Unlocks always work. The only thing that will block an unlock request
is a state update to the slave servers. Which happen pretty quick.
looking at patch now......
looks ok to me.
This bugzilla is reported to have been fixed years ago.