324881 – first mount can hang during parallel mounts

Bug 324881 - first mount can hang during parallel mounts

Summary: first mount can hang during parallel mounts

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	GFS-kernel
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-10-09 13:47 UTC by David Teigland
Modified:	2010-01-12 03:19 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-03-12 19:54:47 UTC
Embargoed:

Attachments	(Terms of Use)
possible patch (600 bytes, text/plain) 2007-10-15 13:39 UTC, David Teigland	no flags	Details
View All

Description David Teigland 2007-10-09 13:47:17 UTC

Description of problem:

This was found while trying to reproduce bz 299061. Running the test in
comment 17 of that bz seems to reproduce this bug quite easily; at least
in my xen cluster which is not smp, this bug may be harder to hit on
smp machines.

It's easy to tell if you've hit this bug, because a message like this will
always appear in /var/log/messages:

SM: 02000378 ignoring service callback id=2000144 event=1324

If you look at /proc/cluster/lock_dlm/debug on this node at this point,
you'll see something like this at the end, which shows what the problem
is:

others_may_mount start_done 1322 b

The event_id that others_may_mount uses when calling kcl_start_done()
is incorrect; it's using 1322 when it should be 1324.

I believe the fix is for others_may_mount() to read the event_id
after taking the umount_lock semaphore which serializes
others_may_mount() with a start callback from the lock_dlm thread.
In this case, I believe the start callback is changing the event_id
after others_may_mount reads it, and before othres_may_mount gets
the umount_lock semaphore.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2007-10-10 14:08:40 UTC

This has not been reproducable on smp machines so far.

Comment 2 David Teigland 2007-10-15 13:39:07 UTC

Created attachment 227681 [details]
possible patch

This patch seems to fix the problem.

Comment 3 David Teigland 2008-01-14 15:36:03 UTC

fix checked into RHEL4 branch

Checking in mount.c;
/cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/mount.c,v  <--  mount.c
new revision: 1.11.2.4; previous revision: 1.11.2.3

Comment 4 Steve Whitehouse 2009-01-20 15:19:41 UTC

This seemed to be missing flags.

Comment 5 Chris Feist 2009-03-12 19:54:47 UTC

This was fxied way back in January of 2008, it's already in 4.7.

Note You need to log in before you can comment on or make changes to this bug.