Description of problem: If a node holding the listlock for a filesystem expires, recovery and mounting of that filesystem will hang. This is because inorder to clean up the lock space for GFS, you must first access the lock space for GFS (which is where the jid mappings are stored). Since these jid specific locks are being accessed without the IgnoreExpire flag set, they will block untill the holder of the listlock is replayed, which can not happen unless the listlock is held. Version-Release number of selected component (if applicable): GFS-6.0.0-1.2 How reproducible: Very rare. Steps to Reproduce: 1. Mount a GFS filesystem on Node1 2. Cause Node2 to expire 3. Crash Node1 while it holds the listlock 4. GFS filesystem is now hung. No other node can mount that filesystem or recover that filesystem until the entire lockspace is cleaned (all lock servers are shutdown and then restarted) or the filesystem is given a new name. Additional info: I ran into this problem twice while running some recovery tests. Both cases required that the nodes got bounced rather frequently (every couple minutes).
Created attachment 102993 [details] insert a couple breakpoints into lock_gulm.o for testing The attached patch creates some breakpoints for reproducing the bug. Two reproduce the bug with this patch: 1. Mount GFS on Node1 2. Load lock_gulm.o with the breakpoint number on Node2 `insmod gulm_breakpoint=1 lock_gulm.o` 3. Crash Node1 4. Node2 will now panic when trying to recover Node1. Once this happens, no new nodes can mount (any other node that may have been mounted at the time will not be able to replay the journal for Node1 or Node2 either)
Created attachment 103002 [details] make jid mappings ignore expired state of locks The attached patch makes the jid mapping requests use the lg_lock_flag_IgnoreExp (ignore expired flag) when aquiring the listlock and journal locks. I've done some rather basic testing and it seems to work. I'm waiting for Mike Tilstra to review the code once he returns back from vacation. (Does this need to also be set when dealing with unlock requests? I don't think it does since it means that the holder of the lock is trying to unlock it's locks while expired, which is just not allowed)
Unlocks always work. The only thing that will block an unlock request is a state update to the slave servers. Which happen pretty quick. looking at patch now......
looks ok to me.
This bugzilla is reported to have been fixed years ago.