Red Hat Bugzilla – Bug 161146
intermittant OOPS in DLM kernel module inside add_to_astqueue
Last modified: 2009-04-24 10:43:30 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050417 Fedora/1.7.7-1.3.1
Description of problem:
I got the following oops messages on my cluster nodes, both at different
times. Once was on node A, I was running a clustat, and did a ctrl-4 to
kill it, (it was taking a long while to run, seemed to be blocked by
something). The second time after doing that OOPS#1 showed up. The
second oops showed up on the b node, the cluster was running, and I
wasn't actually doing anything outside of watching a tcpdump to watch
some data flow by, went away for about 10 minutes, and when I came back
node B had blocked up, and was fenced by A. The OOPS was in the
These events were separated by about a week, and in between I had
updated everything to RHEL4 U1, and recompiled the cluster code which
was checked out from the RHEL4 branch for the new kernel.
Yes, these nodes both have VMWare loaded. I can move the virtual
machines off to another host, and disabled VMware, and try and replicate
the problem again if you think VMWare might be causing the problem. (it
may take a week or so, since this problem seems to be intermittent)
Two nodes in the cluster, shared ext3 partitions, a few services
(apache, postgresql, a vmware virtual machine) All nodes running Redhat
Enterprise 4 on identical HP DL380 G4 Dual Xeon boxes, with
hyperthreading enabled. A Memtest86 on the B node went through two
successful passes, run soon after oops.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Has happened twice on two nodes in the same cluster, one week apart. No detailed steps to reproduce as of now.
Created attachment 115714 [details]
OOPS Messages, Number #1 and #2
Forgot to mention, this was originally posted to linux-cluster. Added to
bugzilla at the request of Patrick Caulfield.
I'm having real trouble locating this bug, and certainly can't reproduce it
(though that's hardly surprising given what you said.
What applications do you have that are using the DLM? is it just clvmd or are
there others? "cat /proc/cluster/services" should help. If there are other
applications using the DLM do you know what they are doing?
Kernel-level applications (eg GFS) can be discounted from this as the oops is in
the kernel->userland interface code.
Heh, kind of funny you post this today. Cause it happened again yesterday on
one of my nodes (While I wasn't doing anything) first time since I posted this.
Is there any debug code you'd like me to add to the dlm kernel module, so when
it does happen again we have some information to go on?
I'm running fenced, clvm, rgmanager.
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
DLM Lock Space: "clvmd" 2 3 run -
DLM Lock Space: "Magma" 7 5 run -
User: "usrm::manager" 6 4 run -
Created attachment 116707 [details]
Debug patch against dlm-kernel/src branch STABLE
This patch might be a little heavy-handed but it doesn't look like you're
doinng very much userspace locking so it should be OK. If you get another oops
with this can you post it again because I've poisened the kfree'd blocks to
show which one is being freed too early (I think).
Patch has been installed for over a week now on both nodes, still no OOPS. I'm
gonna keep running on the nodes till it does.
Created attachment 117328 [details]
The debug-patch oops message, and some preceding lock debugging
It finally oopsed, attached is a small piece of my 300 MB messages file.
Created attachment 117338 [details]
Thanks, that's really helpful. This patch should get rid of the oops. I've left
in the debugging for now (if that's OK) because I'm still not sure exactly how
this is happening, the patch is effectively a workaround.
It applies over the existing device.c you have.
You want me to wait for another oops, or just post some aggregate info after
some time? Also, any other debug code you want to add to help track down who
the culprit is?
I've installed the updated module on one of the nodes, and moved a few resources
over to that node. I'll install it on the other node once I can find a few
minutes to fail the rest of the resources over.
Sorry that was a little unclear (I"m just back from holiday!).
I hope you won't see the oops again - I suppose a different one might trigger if
we're really unlucky.
So, if you do get an oops then please send me the tail end of the log as you
kindly did last time.
If there is no oops after some reasonable period of time could you also gzip the
log file (filter out any non-DLMDEBUG messages) and make it available to me
somewhere - I suspect it might be a bit large for bugzilla!
A tidier version of the last patch has been committed to CVS STABLE & RHEL4