Bug 161146
Summary: | intermittant OOPS in DLM kernel module inside add_to_astqueue | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Eric Kerin <eric> | ||||||||||
Component: | dlm | Assignee: | Christine Caulfield <ccaulfie> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 4 | CC: | cluster-maint, poelstra, tao | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | i386 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2009-04-24 14:43:30 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Eric Kerin
2005-06-20 20:12:39 UTC
Created attachment 115714 [details]
OOPS Messages, Number #1 and #2
Forgot to mention, this was originally posted to linux-cluster. Added to bugzilla at the request of Patrick Caulfield. I'm having real trouble locating this bug, and certainly can't reproduce it (though that's hardly surprising given what you said. What applications do you have that are using the DLM? is it just clvmd or are there others? "cat /proc/cluster/services" should help. If there are other applications using the DLM do you know what they are doing? Kernel-level applications (eg GFS) can be discounted from this as the oops is in the kernel->userland interface code. Heh, kind of funny you post this today. Cause it happened again yesterday on one of my nodes (While I wasn't doing anything) first time since I posted this. Is there any debug code you'd like me to add to the dlm kernel module, so when it does happen again we have some information to go on? I'm running fenced, clvm, rgmanager. /proc/cluster/services: Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 1] DLM Lock Space: "clvmd" 2 3 run - [2 1] DLM Lock Space: "Magma" 7 5 run - [2 1] User: "usrm::manager" 6 4 run - [2 1] Created attachment 116707 [details]
Debug patch against dlm-kernel/src branch STABLE
This patch might be a little heavy-handed but it doesn't look like you're
doinng very much userspace locking so it should be OK. If you get another oops
with this can you post it again because I've poisened the kfree'd blocks to
show which one is being freed too early (I think).
Thanks.
Patch has been installed for over a week now on both nodes, still no OOPS. I'm gonna keep running on the nodes till it does. Created attachment 117328 [details]
The debug-patch oops message, and some preceding lock debugging
It finally oopsed, attached is a small piece of my 300 MB messages file.
Created attachment 117338 [details]
Another patch
Thanks, that's really helpful. This patch should get rid of the oops. I've left
in the debugging for now (if that's OK) because I'm still not sure exactly how
this is happening, the patch is effectively a workaround.
It applies over the existing device.c you have.
You want me to wait for another oops, or just post some aggregate info after some time? Also, any other debug code you want to add to help track down who the culprit is? I've installed the updated module on one of the nodes, and moved a few resources over to that node. I'll install it on the other node once I can find a few minutes to fail the rest of the resources over. Sorry that was a little unclear (I"m just back from holiday!). I hope you won't see the oops again - I suppose a different one might trigger if we're really unlucky. So, if you do get an oops then please send me the tail end of the log as you kindly did last time. If there is no oops after some reasonable period of time could you also gzip the log file (filter out any non-DLMDEBUG messages) and make it available to me somewhere - I suspect it might be a bit large for bugzilla! Thanks. A tidier version of the last patch has been committed to CVS STABLE & RHEL4 branches. |