525739 – GFS2 oops when looping on mount/umount

Bug 525739 - GFS2 oops when looping on mount/umount

Summary: GFS2 oops when looping on mount/umount

Keywords:
Status:	CLOSED DUPLICATE of bug 537010
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	GFS-kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Steve Whitehouse
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-09-25 13:57 UTC by Fabio Massimo Di Nitto
Modified:	2009-11-13 17:19 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-11-13 17:19:47 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
OOPS (40.99 KB, text/plain) 2009-09-25 13:57 UTC, Fabio Massimo Di Nitto	no flags	Details
script to loop mount/umount (430 bytes, text/plain) 2009-09-25 14:00 UTC, Fabio Massimo Di Nitto	no flags	Details
View All

Description Fabio Massimo Di Nitto 2009-09-25 13:57:53 UTC

Created attachment 362680 [details]
OOPS

Description of problem:

use a loop to mount/umount GFS2 on fedora rawhide.
umount will fail with attached oops after sometime
the cluster will 


Version-Release number of selected component (if applicable):

latest kernel in fedora rawhide 2.6.31-40.fc12.i686.PAE

How reproducible:

random/often

Comment 1 Fabio Massimo Di Nitto 2009-09-25 14:00:00 UTC

Created attachment 362681 [details]
script to loop mount/umount

Comment 2 Steve Whitehouse 2009-09-25 14:12:18 UTC

Looks like the glock has a broken ref count somehow. Probably the dlm reply is referencing deallocated memory in that case.

Comment 3 Fabio Massimo Di Nitto 2009-09-25 14:35:17 UTC

the cluster is a 6 nodes, mixed (3 x x86 and 3 x x86_64) nodes.

4 nodes mount the same gfs2 paritition (of which 2 nodes are writing to it and 2 are idle).

there is one x86 and one x86_64 mounter/umounter running the script. As you can see the script adds random timeouts so it could be a race in the refcount handling.

Comment 4 Steve Whitehouse 2009-09-25 14:41:06 UTC

Hmm, well that still odd. The glock in question is:

 G:  s:UN n:2/264793 f:I t:UN d:EX/0 a:0 r:0

and its triggering a test which checks that the ref count of the glock is above 0 when the bast comes in. So the question is why are we getting a bast for a lock which appears to already be unlocked? That makes no sense to me.

It is just possible that the glock is between the ref count hitting zero and being deallocated since the deallocation happens from the dlm ast callback, but even so I can't see why we should be getting any dlm callbacks at that stage.

Comment 5 Steve Whitehouse 2009-09-25 14:52:03 UTC

The second oops is probably down to trying to wake a kernel thread (astd) which has died (in the earlier oops).

Comment 6 Steve Whitehouse 2009-09-25 16:12:53 UTC

Also, since this is upstream and it has the tracepoints in it, that would be the best way to debug this issue. It should show clearly at what point in the lifetime of the glock the issue occurred.

Comment 7 Steve Whitehouse 2009-11-13 17:19:47 UTC


*** This bug has been marked as a duplicate of bug 537010 ***

Note You need to log in before you can comment on or make changes to this bug.