From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050417 Fedora/1.7.7-1.3.1 Description of problem: I got the following oops messages on my cluster nodes, both at different times. Once was on node A, I was running a clustat, and did a ctrl-4 to kill it, (it was taking a long while to run, seemed to be blocked by something). The second time after doing that OOPS#1 showed up. The second oops showed up on the b node, the cluster was running, and I wasn't actually doing anything outside of watching a tcpdump to watch some data flow by, went away for about 10 minutes, and when I came back node B had blocked up, and was fenced by A. The OOPS was in the messages file. These events were separated by about a week, and in between I had updated everything to RHEL4 U1, and recompiled the cluster code which was checked out from the RHEL4 branch for the new kernel. Yes, these nodes both have VMWare loaded. I can move the virtual machines off to another host, and disabled VMware, and try and replicate the problem again if you think VMWare might be causing the problem. (it may take a week or so, since this problem seems to be intermittent) Two nodes in the cluster, shared ext3 partitions, a few services (apache, postgresql, a vmware virtual machine) All nodes running Redhat Enterprise 4 on identical HP DL380 G4 Dual Xeon boxes, with hyperthreading enabled. A Memtest86 on the B node went through two successful passes, run soon after oops. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: Has happened twice on two nodes in the same cluster, one week apart. No detailed steps to reproduce as of now. Additional info:
Created attachment 115714 [details] OOPS Messages, Number #1 and #2
Forgot to mention, this was originally posted to linux-cluster. Added to bugzilla at the request of Patrick Caulfield.
I'm having real trouble locating this bug, and certainly can't reproduce it (though that's hardly surprising given what you said. What applications do you have that are using the DLM? is it just clvmd or are there others? "cat /proc/cluster/services" should help. If there are other applications using the DLM do you know what they are doing? Kernel-level applications (eg GFS) can be discounted from this as the oops is in the kernel->userland interface code.
Heh, kind of funny you post this today. Cause it happened again yesterday on one of my nodes (While I wasn't doing anything) first time since I posted this. Is there any debug code you'd like me to add to the dlm kernel module, so when it does happen again we have some information to go on? I'm running fenced, clvm, rgmanager. /proc/cluster/services: Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 1] DLM Lock Space: "clvmd" 2 3 run - [2 1] DLM Lock Space: "Magma" 7 5 run - [2 1] User: "usrm::manager" 6 4 run - [2 1]
Created attachment 116707 [details] Debug patch against dlm-kernel/src branch STABLE This patch might be a little heavy-handed but it doesn't look like you're doinng very much userspace locking so it should be OK. If you get another oops with this can you post it again because I've poisened the kfree'd blocks to show which one is being freed too early (I think). Thanks.
Patch has been installed for over a week now on both nodes, still no OOPS. I'm gonna keep running on the nodes till it does.
Created attachment 117328 [details] The debug-patch oops message, and some preceding lock debugging It finally oopsed, attached is a small piece of my 300 MB messages file.
Created attachment 117338 [details] Another patch Thanks, that's really helpful. This patch should get rid of the oops. I've left in the debugging for now (if that's OK) because I'm still not sure exactly how this is happening, the patch is effectively a workaround. It applies over the existing device.c you have.
You want me to wait for another oops, or just post some aggregate info after some time? Also, any other debug code you want to add to help track down who the culprit is? I've installed the updated module on one of the nodes, and moved a few resources over to that node. I'll install it on the other node once I can find a few minutes to fail the rest of the resources over.
Sorry that was a little unclear (I"m just back from holiday!). I hope you won't see the oops again - I suppose a different one might trigger if we're really unlucky. So, if you do get an oops then please send me the tail end of the log as you kindly did last time. If there is no oops after some reasonable period of time could you also gzip the log file (filter out any non-DLMDEBUG messages) and make it available to me somewhere - I suspect it might be a bit large for bugzilla! Thanks.
A tidier version of the last patch has been committed to CVS STABLE & RHEL4 branches.