Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 176838

Summary:	extra dlm completion callback during d_rwrandirectlarge
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Nate Straz <nstraz>
Component:	GFS-kernel	Assignee:	David Teigland <teigland>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	ccaulfie
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2007-0998	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-11-21 21:14:07 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 6 David Teigland 2006-01-06 21:19:48 UTC

Below are what appear to be the relevent bits of debug info.
The test uses directio (CW locks) in combination with
modification to the file (EX locks).  We should try to
narrow this down to as simple a test possible that still
creates the problem (or even triggers it more quickly/
easily if possible).  We should probably start by extracting
the key bits of d_rwrandirectlarge into a simple standalone
program.

tank-01
Resource d299ac2c (parent 00000000). Name (len=24) "       2         47350f8"
Local Copy, Master is node 3
Granted Queue
0133012e CW Master:     00e003e5
Conversion Queue
Waiting Queue

tank-03
Resource d58e086c (parent 00000000). Name (len=24) "       2         47350f8"
Master Copy
LVB: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
Granted Queue
00e003e5 CW Remote:   1 0133012e
Conversion Queue
Waiting Queue

tank-05
Jan  3 17:34:55 tank-05
kernel: dlm: vedder: process_lockqueue_reply id 31f0254 state 0
Jan  3 17:34:55 tank-05
kernel: lock_dlm: extra completion 2,47350f8 2,5 id 31f0254 flags 0
(2,5 == CW->EX)

Comment 7 David Teigland 2006-01-23 16:06:22 UTC

I've been trying to reproduce this with my own direct-io
test and haven't been able to.  Hope to try the original
test myself next.

Comment 8 Kiersten (Kerri) Anderson 2006-09-22 19:10:21 UTC

Devel ACK for 4.5 if reproducible.  If not reproducible, let's close this one
out until more information is provided.

Comment 9 David Teigland 2006-10-17 16:27:15 UTC

To fix this I will need a test I can run to reliably reproduce the
problem while adding progressively more refined debugging.

Comment 10 Nate Straz 2006-10-17 19:19:14 UTC

Dave,

I haven't hit this bug in quite a while.   d_rwrandirectlarge is a distributed
test case so it's not very easy to break it down to a simple script. I just made
some changes to our test suite to make it easier to run just this test case.  I
can now easily create new scenarios for dd_io (which is the mid-level script
that runs d_rwrandirectlarge) so we can pull out just that tag and run it in a loop.

Comment 11 David Teigland 2006-11-01 15:38:32 UTC

Ran this test for some days and didn't have a problem, closing it.

Comment 13 David Teigland 2007-05-30 18:26:16 UTC

It's not the backtrace that was interesting in this bug, it was this key
error message:

  lock_dlm: extra completion 2,8ebf69b 2,5 id bd03cf flags 0

If that error message didn't appear, then it's unrelated and we should reclose
this bug.

Comment 15 David Teigland 2007-05-30 18:57:39 UTC

It looks like this bug still exists.  Do we know what the customer was running
that might give us a clue as to how to reproduce this ourselves?

Comment 16 Bryn M. Reeves 2007-06-06 09:15:39 UTC

The problem is described as happening "randomly" and only occurs about once a
fortnight. I'll try to get more details of the workload & services that are
configured.

Comment 18 Bryn M. Reeves 2007-08-31 11:56:16 UTC

Does anyone have any suggestions on pre-emptive monitoring that could be done here?

After several months inactivity and stable running, the affected cluster
panic'ed again with the same message but there's no real additional data on what
was happening to trigger the problem.

Comment 19 David Teigland 2007-08-31 15:20:48 UTC

Using my questionable disassembly skills, I think the instruction that
causes the null pointer dereference in process_complete() is this one:

5565:       f3 ab                   rep stos %eax,%es:(%edi)

which I think corresponds to the memset of the lvbptr:

        if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID)
                memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN);

which may very well be null, although I'm not certain why we'd be
sent VALNOTVALID if we had no lvb.  The VALNOTVALID conditions have
always been a bit shaky, though.

I'll commit the following code change to the RHEL4 branch to check
for a null lvb before doing the memset.  I think there's a pretty
good chance this will fix the problem, but I'm not certain.  If they
see the new printk, it will verify the fix was correct.

RCS file: /cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/thread.c,v
retrieving revision 1.16.2.5
diff -u -r1.16.2.5 thread.c
--- thread.c    8 Dec 2006 17:31:31 -0000       1.16.2.5
+++ thread.c    31 Aug 2007 15:20:30 -0000
@@ -116,8 +116,13 @@
                goto out;
        }
 
-       if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID)
-               memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN);
+       if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID) {
+               if (lp->lksb.sb_lvbptr)
+                       memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN);
+               else
+                       log_all("no lvb for VALNOTVALID lkid %x",
+                               lp->lksb.sb_lkid);
+       }
 
        if (lp->lksb.sb_flags & DLM_SBF_ALTMODE) {
                if (lp->req == DLM_LOCK_PR)

Comment 23 errata-xmlrpc 2007-11-21 21:14:07 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0998.html