Bug 176838 - extra dlm completion callback during d_rwrandirectlarge
extra dlm completion callback during d_rwrandirectlarge
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: GFS-kernel (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-01-03 10:59 EST by Nate Straz
Modified: 2010-10-21 23:50 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2007-0998
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-21 16:14:07 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Comment 6 David Teigland 2006-01-06 16:19:48 EST
Below are what appear to be the relevent bits of debug info.
The test uses directio (CW locks) in combination with
modification to the file (EX locks).  We should try to
narrow this down to as simple a test possible that still
creates the problem (or even triggers it more quickly/
easily if possible).  We should probably start by extracting
the key bits of d_rwrandirectlarge into a simple standalone
program.

tank-01
Resource d299ac2c (parent 00000000). Name (len=24) "       2         47350f8"
Local Copy, Master is node 3
Granted Queue
0133012e CW Master:     00e003e5
Conversion Queue
Waiting Queue

tank-03
Resource d58e086c (parent 00000000). Name (len=24) "       2         47350f8"
Master Copy
LVB: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
Granted Queue
00e003e5 CW Remote:   1 0133012e
Conversion Queue
Waiting Queue

tank-05
Jan  3 17:34:55 tank-05
kernel: dlm: vedder: process_lockqueue_reply id 31f0254 state 0
Jan  3 17:34:55 tank-05
kernel: lock_dlm: extra completion 2,47350f8 2,5 id 31f0254 flags 0
(2,5 == CW->EX)
Comment 7 David Teigland 2006-01-23 11:06:22 EST
I've been trying to reproduce this with my own direct-io
test and haven't been able to.  Hope to try the original
test myself next.
Comment 8 Kiersten (Kerri) Anderson 2006-09-22 15:10:21 EDT
Devel ACK for 4.5 if reproducible.  If not reproducible, let's close this one
out until more information is provided.
Comment 9 David Teigland 2006-10-17 12:27:15 EDT
To fix this I will need a test I can run to reliably reproduce the
problem while adding progressively more refined debugging.
Comment 10 Nate Straz 2006-10-17 15:19:14 EDT
Dave,

I haven't hit this bug in quite a while.   d_rwrandirectlarge is a distributed
test case so it's not very easy to break it down to a simple script. I just made
some changes to our test suite to make it easier to run just this test case.  I
can now easily create new scenarios for dd_io (which is the mid-level script
that runs d_rwrandirectlarge) so we can pull out just that tag and run it in a loop.
Comment 11 David Teigland 2006-11-01 10:38:32 EST
Ran this test for some days and didn't have a problem, closing it.
Comment 13 David Teigland 2007-05-30 14:26:16 EDT
It's not the backtrace that was interesting in this bug, it was this key
error message:

  lock_dlm: extra completion 2,8ebf69b 2,5 id bd03cf flags 0

If that error message didn't appear, then it's unrelated and we should reclose
this bug.
Comment 15 David Teigland 2007-05-30 14:57:39 EDT
It looks like this bug still exists.  Do we know what the customer was running
that might give us a clue as to how to reproduce this ourselves?
Comment 16 Bryn M. Reeves 2007-06-06 05:15:39 EDT
The problem is described as happening "randomly" and only occurs about once a
fortnight. I'll try to get more details of the workload & services that are
configured.
Comment 18 Bryn M. Reeves 2007-08-31 07:56:16 EDT
Does anyone have any suggestions on pre-emptive monitoring that could be done here?

After several months inactivity and stable running, the affected cluster
panic'ed again with the same message but there's no real additional data on what
was happening to trigger the problem.
Comment 19 David Teigland 2007-08-31 11:20:48 EDT
Using my questionable disassembly skills, I think the instruction that
causes the null pointer dereference in process_complete() is this one:

5565:       f3 ab                   rep stos %eax,%es:(%edi)

which I think corresponds to the memset of the lvbptr:

        if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID)
                memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN);

which may very well be null, although I'm not certain why we'd be
sent VALNOTVALID if we had no lvb.  The VALNOTVALID conditions have
always been a bit shaky, though.

I'll commit the following code change to the RHEL4 branch to check
for a null lvb before doing the memset.  I think there's a pretty
good chance this will fix the problem, but I'm not certain.  If they
see the new printk, it will verify the fix was correct.

RCS file: /cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/thread.c,v
retrieving revision 1.16.2.5
diff -u -r1.16.2.5 thread.c
--- thread.c    8 Dec 2006 17:31:31 -0000       1.16.2.5
+++ thread.c    31 Aug 2007 15:20:30 -0000
@@ -116,8 +116,13 @@
                goto out;
        }
 
-       if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID)
-               memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN);
+       if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID) {
+               if (lp->lksb.sb_lvbptr)
+                       memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN);
+               else
+                       log_all("no lvb for VALNOTVALID lkid %x",
+                               lp->lksb.sb_lkid);
+       }
 
        if (lp->lksb.sb_flags & DLM_SBF_ALTMODE) {
                if (lp->req == DLM_LOCK_PR)

Comment 23 errata-xmlrpc 2007-11-21 16:14:07 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0998.html

Note You need to log in before you can comment on or make changes to this bug.