Bug 384861
Summary: | OOPS in dlm_astd | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Bryn M. Reeves <bmr> | ||||||||
Component: | dlm-kernel | Assignee: | David Teigland <teigland> | ||||||||
Status: | CLOSED WORKSFORME | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4 | CC: | ccaulfie, cluster-maint | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2008-02-26 17:20:58 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Bryn M. Reeves
2007-11-15 15:48:04 UTC
Created attachment 259951 [details]
disassembly of dlm_astd
Disassembled dlm_astd with comments/annotations.
First thing to consider is setting the following two config options under /proc/cluster/config/dlm/ to 0: deadlocktime lock_timeout This will disable the sections of code that are likely in question here, which return EDEADLCK and ETIMEDOUT, assuming that the program doesn't rely on these. Created attachment 260371 [details]
revised disassembly of dlm_astd
I noticed a mistake in the previous attachment - the pointer manipulation
around dlm_astd+1308 (list_for_each_entry_safe() {/* ... */} setup) was all
wrong.
I think this version is correct - the faulting instruction seems to be the
attempt to load the next member from the saved lkb before continuing back
around the loop:
load safe->lkb_deadlockqueue->next into %r12
For some reason, the prev/next pointers in lkb_deadlockqueue for the last entry
in the _deadlockqueue both seem to be NULL.
Need to check I have the correct crash invocation here, but this appears to be the content of the _deadlockqueue at the time of the crash: crash> list _deadlockqueue ffffffffa0211880 100f173acd8 100f173a110 100f173adc0 *** crash> list -H _deadlockqueue dlm_lkb.lkb_deadlockq -s dlm_lkb [...] 100f173ad00 struct dlm_lkb { lkb_flags = 262144, lkb_status = 2, lkb_rqmode = 5 '\005', lkb_grmode = 0 '\0', lkb_retstatus = 4294967285, lkb_id = 65971, lkb_lksb = 0x0, lkb_idtbl_list = { next = 0x100f20cb660, prev = 0x100f20cb660 }, lkb_statequeue = { next = 0x100f173aef8, prev = 0x100f1ede098 }, lkb_resource = 0x100f1ede048, lkb_parent = 0x0, lkb_childcnt = { counter = 0 }, lkb_lockqueue = { next = 0x0, prev = 0x0 }, lkb_lockqueue_state = 0, lkb_lockqueue_flags = 5, lkb_ownpid = 5469, lkb_lockqueue_time = 0, lkb_duetime = 0, lkb_remid = 66260, lkb_nodeid = 2, lkb_astaddr = 0x1, lkb_bastaddr = 0x0, lkb_astparam = 0, lkb_astqueue = { next = 0x0, prev = 0x0 }, lkb_astflags = 0, lkb_bastmode = 0 '\0', lkb_highbast = 0 '\0', lkb_request = 0x0, lkb_deadlockq = { next = 0x0, | *** These should not be NULL! prev = 0x0 | *** }, lkb_lvbseq = 0, lkb_lvbptr = 0x0, lkb_range = 0x0 } Created attachment 260381 [details]
list -H _deadlockqueue dlm_lkb.lkb_deadlockq -s dlm_lkb
Complete dump of the deadlockqueue
Bryn's analysis confirms that we should set the deadlocktime to 0 which should avoid this problem. lock_timeout should also be set to 0 to avoid any similar problems there. We might want to consider changing some code to set the NOTIMERS flag on the lockspace to avoid the deadlock queue altogether. Of course, the cause of the apparent list corruption would also be interesting to find. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This was fixed by setting /proc/cluster/config/dlm/deadlocktime to 0 |