Created attachment 362680 [details] OOPS Description of problem: use a loop to mount/umount GFS2 on fedora rawhide. umount will fail with attached oops after sometime the cluster will Version-Release number of selected component (if applicable): latest kernel in fedora rawhide 2.6.31-40.fc12.i686.PAE How reproducible: random/often
Created attachment 362681 [details] script to loop mount/umount
Looks like the glock has a broken ref count somehow. Probably the dlm reply is referencing deallocated memory in that case.
the cluster is a 6 nodes, mixed (3 x x86 and 3 x x86_64) nodes. 4 nodes mount the same gfs2 paritition (of which 2 nodes are writing to it and 2 are idle). there is one x86 and one x86_64 mounter/umounter running the script. As you can see the script adds random timeouts so it could be a race in the refcount handling.
Hmm, well that still odd. The glock in question is: G: s:UN n:2/264793 f:I t:UN d:EX/0 a:0 r:0 and its triggering a test which checks that the ref count of the glock is above 0 when the bast comes in. So the question is why are we getting a bast for a lock which appears to already be unlocked? That makes no sense to me. It is just possible that the glock is between the ref count hitting zero and being deallocated since the deallocation happens from the dlm ast callback, but even so I can't see why we should be getting any dlm callbacks at that stage.
The second oops is probably down to trying to wake a kernel thread (astd) which has died (in the earlier oops).
Also, since this is upstream and it has the tracepoints in it, that would be the best way to debug this issue. It should show clearly at what point in the lifetime of the glock the issue occurred.
*** This bug has been marked as a duplicate of bug 537010 ***