Red Hat Bugzilla – Bug 324881
first mount can hang during parallel mounts
Last modified: 2010-01-11 22:19:02 EST
Description of problem:
This was found while trying to reproduce bz 299061. Running the test in
comment 17 of that bz seems to reproduce this bug quite easily; at least
in my xen cluster which is not smp, this bug may be harder to hit on
It's easy to tell if you've hit this bug, because a message like this will
always appear in /var/log/messages:
SM: 02000378 ignoring service callback id=2000144 event=1324
If you look at /proc/cluster/lock_dlm/debug on this node at this point,
you'll see something like this at the end, which shows what the problem
others_may_mount start_done 1322 b
The event_id that others_may_mount uses when calling kcl_start_done()
is incorrect; it's using 1322 when it should be 1324.
I believe the fix is for others_may_mount() to read the event_id
after taking the umount_lock semaphore which serializes
others_may_mount() with a start callback from the lock_dlm thread.
In this case, I believe the start callback is changing the event_id
after others_may_mount reads it, and before othres_may_mount gets
the umount_lock semaphore.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
This has not been reproducable on smp machines so far.
Created attachment 227681 [details]
This patch seems to fix the problem.
fix checked into RHEL4 branch
Checking in mount.c;
/cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/mount.c,v <-- mount.c
new revision: 18.104.22.168; previous revision: 22.214.171.124
This seemed to be missing flags.
This was fxied way back in January of 2008, it's already in 4.7.