161146 – intermittant OOPS in DLM kernel module inside add_to_astqueue

Bug 161146 - intermittant OOPS in DLM kernel module inside add_to_astqueue

Summary: intermittant OOPS in DLM kernel module inside add_to_astqueue

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm
Sub Component:
Version:	4
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-06-20 20:12 UTC by Eric Kerin
Modified:	2009-04-24 14:43 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-04-24 14:43:30 UTC
Embargoed:

Attachments	(Terms of Use)
OOPS Messages, Number #1 and #2 (3.22 KB, text/plain) 2005-06-20 20:13 UTC, Eric Kerin	no flags	Details
Debug patch against dlm-kernel/src branch STABLE (2.35 KB, patch) 2005-07-13 13:31 UTC, Christine Caulfield	no flags	Details \| Diff
The debug-patch oops message, and some preceding lock debugging (4.64 KB, text/plain) 2005-07-31 21:10 UTC, Eric Kerin	no flags	Details
Another patch (534 bytes, patch) 2005-08-01 09:41 UTC, Christine Caulfield	no flags	Details \| Diff
View All

Description Eric Kerin 2005-06-20 20:12:39 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050417 Fedora/1.7.7-1.3.1

Description of problem:
I got the following oops messages on my cluster nodes, both at different
times. Once was on node A, I was running a clustat, and did a ctrl-4 to
kill it, (it was taking a long while to run, seemed to be blocked by
something). The second time after doing that OOPS#1 showed up. The
second oops showed up on the b node, the cluster was running, and I
wasn't actually doing anything outside of watching a tcpdump to watch
some data flow by, went away for about 10 minutes, and when I came back
node B had blocked up, and was fenced by A. The OOPS was in the
messages file.

These events were separated by about a week, and in between I had
updated everything to RHEL4 U1, and recompiled the cluster code which
was checked out from the RHEL4 branch for the new kernel.

Yes, these nodes both have VMWare loaded. I can move the virtual
machines off to another host, and disabled VMware, and try and replicate
the problem again if you think VMWare might be causing the problem. (it
may take a week or so, since this problem seems to be intermittent)

Two nodes in the cluster, shared ext3 partitions, a few services
(apache, postgresql, a vmware virtual machine) All nodes running Redhat
Enterprise 4 on identical HP DL380 G4 Dual Xeon boxes, with
hyperthreading enabled. A Memtest86 on the B node went through two
successful passes, run soon after oops.

Version-Release number of selected component (if applicable):

How reproducible:
Sometimes

Steps to Reproduce:
Has happened twice on two nodes in the same cluster, one week apart. No detailed steps to reproduce as of now.

Additional info:

Comment 1 Eric Kerin 2005-06-20 20:13:34 UTC

Created attachment 115714 [details]
OOPS Messages, Number #1 and #2

Comment 2 Eric Kerin 2005-06-20 20:18:38 UTC

Forgot to mention, this was originally posted to linux-cluster.  Added to
bugzilla at the request of Patrick Caulfield.

Comment 3 Christine Caulfield 2005-07-11 10:36:13 UTC

I'm having real trouble locating this bug, and certainly can't reproduce it
(though that's hardly surprising given what you said.

What applications do you have that are using the DLM? is it just clvmd or are
there others? "cat /proc/cluster/services" should help. If there are other
applications using the DLM do you know what they are doing?

Kernel-level applications (eg GFS) can be discounted from this as the oops is in
the kernel->userland interface code.

Comment 4 Eric Kerin 2005-07-11 13:04:52 UTC

Heh, kind of funny you post this today.  Cause it happened again yesterday on
one of my nodes (While I wasn't doing anything) first time since I posted this. 

Is there any debug code you'd like me to add to the dlm kernel module, so when
it does happen again we have some information to go on? 

I'm running fenced, clvm, rgmanager.  

/proc/cluster/services:
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 1]

DLM Lock Space:  "clvmd"                             2   3 run       -
[2 1]

DLM Lock Space:  "Magma"                             7   5 run       -
[2 1]

User:            "usrm::manager"                     6   4 run       -
[2 1]

Comment 5 Christine Caulfield 2005-07-13 13:31:08 UTC

Created attachment 116707 [details]
Debug patch against dlm-kernel/src branch  STABLE

This patch might be a little heavy-handed but it doesn't look like you're
doinng very much userspace locking so it should be OK. If you get another oops
with this can you post it again because I've poisened the kfree'd blocks to
show which one is being freed too early (I think).

Thanks.

Comment 6 Eric Kerin 2005-07-26 16:33:35 UTC

Patch has been installed for over a week now on both nodes, still no OOPS.  I'm
gonna keep running on the nodes till it does.

Comment 7 Eric Kerin 2005-07-31 21:10:24 UTC

Created attachment 117328 [details]
The debug-patch oops message, and some preceding lock debugging

It finally oopsed, attached is a small piece of my 300 MB messages file.

Comment 8 Christine Caulfield 2005-08-01 09:41:50 UTC

Created attachment 117338 [details]
Another patch

Thanks, that's really helpful. This patch should get rid of the oops. I've left
in the debugging for now (if that's OK) because I'm still not sure exactly how
this is happening, the patch is effectively a workaround.

It applies over the existing device.c you have.

Comment 9 Eric Kerin 2005-08-01 14:42:36 UTC

You want me to wait for another oops, or just post some aggregate info after
some time?  Also, any other debug code you want to add to help track down who
the culprit is?

I've installed the updated module on one of the nodes, and moved a few resources
over to that node.  I'll install it on the other node once I can find a few
minutes to fail the rest of the resources over.

Comment 10 Christine Caulfield 2005-08-01 15:25:44 UTC

Sorry that was a little unclear (I"m just back from holiday!).

I hope you won't see the oops again - I suppose a different one might trigger if
we're really unlucky.

So, if you do get an oops then please send me the tail end of the log as you
kindly did last time.

If there is no oops after some reasonable period of time could you also gzip the
log file (filter out any non-DLMDEBUG messages) and make it available to me
somewhere - I suspect it might be a bit large for bugzilla!

Thanks.

Comment 11 Christine Caulfield 2005-08-16 14:04:28 UTC

A tidier version of the last patch has been committed to CVS STABLE & RHEL4
branches.

Note You need to log in before you can comment on or make changes to this bug.