Bug 525739 - GFS2 oops when looping on mount/umount
Summary: GFS2 oops when looping on mount/umount
Keywords:
Status: CLOSED DUPLICATE of bug 537010
Alias: None
Product: Fedora
Classification: Fedora
Component: GFS-kernel
Version: rawhide
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Steve Whitehouse
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-25 13:57 UTC by Fabio Massimo Di Nitto
Modified: 2009-11-13 17:19 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2009-11-13 17:19:47 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
OOPS (40.99 KB, text/plain)
2009-09-25 13:57 UTC, Fabio Massimo Di Nitto
no flags Details
script to loop mount/umount (430 bytes, text/plain)
2009-09-25 14:00 UTC, Fabio Massimo Di Nitto
no flags Details

Description Fabio Massimo Di Nitto 2009-09-25 13:57:53 UTC
Created attachment 362680 [details]
OOPS

Description of problem:

use a loop to mount/umount GFS2 on fedora rawhide.
umount will fail with attached oops after sometime
the cluster will 


Version-Release number of selected component (if applicable):

latest kernel in fedora rawhide 2.6.31-40.fc12.i686.PAE

How reproducible:

random/often

Comment 1 Fabio Massimo Di Nitto 2009-09-25 14:00:00 UTC
Created attachment 362681 [details]
script to loop mount/umount

Comment 2 Steve Whitehouse 2009-09-25 14:12:18 UTC
Looks like the glock has a broken ref count somehow. Probably the dlm reply is referencing deallocated memory in that case.

Comment 3 Fabio Massimo Di Nitto 2009-09-25 14:35:17 UTC
the cluster is a 6 nodes, mixed (3 x x86 and 3 x x86_64) nodes.

4 nodes mount the same gfs2 paritition (of which 2 nodes are writing to it and 2 are idle).

there is one x86 and one x86_64 mounter/umounter running the script. As you can see the script adds random timeouts so it could be a race in the refcount handling.

Comment 4 Steve Whitehouse 2009-09-25 14:41:06 UTC
Hmm, well that still odd. The glock in question is:

 G:  s:UN n:2/264793 f:I t:UN d:EX/0 a:0 r:0

and its triggering a test which checks that the ref count of the glock is above 0 when the bast comes in. So the question is why are we getting a bast for a lock which appears to already be unlocked? That makes no sense to me.

It is just possible that the glock is between the ref count hitting zero and being deallocated since the deallocation happens from the dlm ast callback, but even so I can't see why we should be getting any dlm callbacks at that stage.

Comment 5 Steve Whitehouse 2009-09-25 14:52:03 UTC
The second oops is probably down to trying to wake a kernel thread (astd) which has died (in the earlier oops).

Comment 6 Steve Whitehouse 2009-09-25 16:12:53 UTC
Also, since this is upstream and it has the tracepoints in it, that would be the best way to debug this issue. It should show clearly at what point in the lifetime of the glock the issue occurred.

Comment 7 Steve Whitehouse 2009-11-13 17:19:47 UTC

*** This bug has been marked as a duplicate of bug 537010 ***


Note You need to log in before you can comment on or make changes to this bug.