Below are what appear to be the relevent bits of debug info. The test uses directio (CW locks) in combination with modification to the file (EX locks). We should try to narrow this down to as simple a test possible that still creates the problem (or even triggers it more quickly/ easily if possible). We should probably start by extracting the key bits of d_rwrandirectlarge into a simple standalone program. tank-01 Resource d299ac2c (parent 00000000). Name (len=24) " 2 47350f8" Local Copy, Master is node 3 Granted Queue 0133012e CW Master: 00e003e5 Conversion Queue Waiting Queue tank-03 Resource d58e086c (parent 00000000). Name (len=24) " 2 47350f8" Master Copy LVB: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Granted Queue 00e003e5 CW Remote: 1 0133012e Conversion Queue Waiting Queue tank-05 Jan 3 17:34:55 tank-05 kernel: dlm: vedder: process_lockqueue_reply id 31f0254 state 0 Jan 3 17:34:55 tank-05 kernel: lock_dlm: extra completion 2,47350f8 2,5 id 31f0254 flags 0 (2,5 == CW->EX)
I've been trying to reproduce this with my own direct-io test and haven't been able to. Hope to try the original test myself next.
Devel ACK for 4.5 if reproducible. If not reproducible, let's close this one out until more information is provided.
To fix this I will need a test I can run to reliably reproduce the problem while adding progressively more refined debugging.
Dave, I haven't hit this bug in quite a while. d_rwrandirectlarge is a distributed test case so it's not very easy to break it down to a simple script. I just made some changes to our test suite to make it easier to run just this test case. I can now easily create new scenarios for dd_io (which is the mid-level script that runs d_rwrandirectlarge) so we can pull out just that tag and run it in a loop.
Ran this test for some days and didn't have a problem, closing it.
It's not the backtrace that was interesting in this bug, it was this key error message: lock_dlm: extra completion 2,8ebf69b 2,5 id bd03cf flags 0 If that error message didn't appear, then it's unrelated and we should reclose this bug.
It looks like this bug still exists. Do we know what the customer was running that might give us a clue as to how to reproduce this ourselves?
The problem is described as happening "randomly" and only occurs about once a fortnight. I'll try to get more details of the workload & services that are configured.
Does anyone have any suggestions on pre-emptive monitoring that could be done here? After several months inactivity and stable running, the affected cluster panic'ed again with the same message but there's no real additional data on what was happening to trigger the problem.
Using my questionable disassembly skills, I think the instruction that causes the null pointer dereference in process_complete() is this one: 5565: f3 ab rep stos %eax,%es:(%edi) which I think corresponds to the memset of the lvbptr: if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID) memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN); which may very well be null, although I'm not certain why we'd be sent VALNOTVALID if we had no lvb. The VALNOTVALID conditions have always been a bit shaky, though. I'll commit the following code change to the RHEL4 branch to check for a null lvb before doing the memset. I think there's a pretty good chance this will fix the problem, but I'm not certain. If they see the new printk, it will verify the fix was correct. RCS file: /cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/thread.c,v retrieving revision 1.16.2.5 diff -u -r1.16.2.5 thread.c --- thread.c 8 Dec 2006 17:31:31 -0000 1.16.2.5 +++ thread.c 31 Aug 2007 15:20:30 -0000 @@ -116,8 +116,13 @@ goto out; } - if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID) - memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN); + if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID) { + if (lp->lksb.sb_lvbptr) + memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN); + else + log_all("no lvb for VALNOTVALID lkid %x", + lp->lksb.sb_lkid); + } if (lp->lksb.sb_flags & DLM_SBF_ALTMODE) { if (lp->req == DLM_LOCK_PR)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0998.html