Bug 176838
Summary: | extra dlm completion callback during d_rwrandirectlarge | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Nate Straz <nstraz> |
Component: | GFS-kernel | Assignee: | David Teigland <teigland> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | ccaulfie |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0998 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-21 21:14:07 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 6
David Teigland
2006-01-06 21:19:48 UTC
I've been trying to reproduce this with my own direct-io test and haven't been able to. Hope to try the original test myself next. Devel ACK for 4.5 if reproducible. If not reproducible, let's close this one out until more information is provided. To fix this I will need a test I can run to reliably reproduce the problem while adding progressively more refined debugging. Dave, I haven't hit this bug in quite a while. d_rwrandirectlarge is a distributed test case so it's not very easy to break it down to a simple script. I just made some changes to our test suite to make it easier to run just this test case. I can now easily create new scenarios for dd_io (which is the mid-level script that runs d_rwrandirectlarge) so we can pull out just that tag and run it in a loop. Ran this test for some days and didn't have a problem, closing it. It's not the backtrace that was interesting in this bug, it was this key error message: lock_dlm: extra completion 2,8ebf69b 2,5 id bd03cf flags 0 If that error message didn't appear, then it's unrelated and we should reclose this bug. It looks like this bug still exists. Do we know what the customer was running that might give us a clue as to how to reproduce this ourselves? The problem is described as happening "randomly" and only occurs about once a fortnight. I'll try to get more details of the workload & services that are configured. Does anyone have any suggestions on pre-emptive monitoring that could be done here? After several months inactivity and stable running, the affected cluster panic'ed again with the same message but there's no real additional data on what was happening to trigger the problem. Using my questionable disassembly skills, I think the instruction that causes the null pointer dereference in process_complete() is this one: 5565: f3 ab rep stos %eax,%es:(%edi) which I think corresponds to the memset of the lvbptr: if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID) memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN); which may very well be null, although I'm not certain why we'd be sent VALNOTVALID if we had no lvb. The VALNOTVALID conditions have always been a bit shaky, though. I'll commit the following code change to the RHEL4 branch to check for a null lvb before doing the memset. I think there's a pretty good chance this will fix the problem, but I'm not certain. If they see the new printk, it will verify the fix was correct. RCS file: /cvs/cluster/cluster/gfs-kernel/src/dlm/Attic/thread.c,v retrieving revision 1.16.2.5 diff -u -r1.16.2.5 thread.c --- thread.c 8 Dec 2006 17:31:31 -0000 1.16.2.5 +++ thread.c 31 Aug 2007 15:20:30 -0000 @@ -116,8 +116,13 @@ goto out; } - if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID) - memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN); + if (lp->lksb.sb_flags & DLM_SBF_VALNOTVALID) { + if (lp->lksb.sb_lvbptr) + memset(lp->lksb.sb_lvbptr, 0, DLM_LVB_LEN); + else + log_all("no lvb for VALNOTVALID lkid %x", + lp->lksb.sb_lkid); + } if (lp->lksb.sb_flags & DLM_SBF_ALTMODE) { if (lp->req == DLM_LOCK_PR) An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0998.html |