Description of problem: This was found while trying to reproduce bz 299061. Running the test in comment 17 of that bz seems to reproduce this bug quite easily; at least in my xen cluster which is not smp, this bug may be harder to hit on smp machines. It's easy to tell if you've hit this bug, because a message like this will always appear in /var/log/messages: SM: 02000378 ignoring service callback id=2000144 event=1324 If you look at /proc/cluster/lock_dlm/debug on this node at this point, you'll see something like this at the end, which shows what the problem is: others_may_mount start_done 1322 b The event_id that others_may_mount uses when calling kcl_start_done() is incorrect; it's using 1322 when it should be 1324. I believe the fix is for others_may_mount() to read the event_id after taking the umount_lock semaphore which serializes others_may_mount() with a start callback from the lock_dlm thread. In this case, I believe the start callback is changing the event_id after others_may_mount reads it, and before othres_may_mount gets the umount_lock semaphore. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This has not been reproducable on smp machines so far.
Created attachment 227681 [details] possible patch This patch seems to fix the problem.
fix checked into RHEL4 branch Checking in mount.c; /cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/mount.c,v <-- mount.c new revision: 1.11.2.4; previous revision: 1.11.2.3
This seemed to be missing flags.
This was fxied way back in January of 2008, it's already in 4.7.