Description of problem: This is related to bug 240453. In that bugzilla, a stray assertion caused a BUG() when a custom dlm application was used. Removing the assert fixed that problem but the same application is now triggering a NULL pointer dereference in the astd code: cpqci: module license 'Proprietary' taints kernel. ACPI: PCI Interrupt 0000:01:02.2[B] -> GSI 17 (level, low) -> IRQ 185 CMAN: removing node cnode1 from the cluster : Missed too many heartbeats CMAN: node cnode1 rejoining CMAN: removing node cnode1 from the cluster : Missed too many heartbeats CMAN: node cnode1 rejoining Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffffa01f1ed3>{:dlm:dlm_astd+1482} PML4 f2f75067 PGD f2f3d067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: cpqci(U) netconsole netdump i2c_dev i2c_core dlm(U) cman(U) md5 ipv6 8021q button battery ac ohci_hcd hw_random k8_edac edac_mc shpchp tg3 e1000 bonding(U) floppy sg st dm_snapshot dm_zero dm_mirror ext3 jb d dm_mod cciss mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod Pid: 4496, comm: dlm_astd Tainted: P 2.6.9-55.0.9.ELsmp RIP: 0010:[<ffffffffa01f1ed3>] <ffffffffa01f1ed3>{:dlm:dlm_astd+1482} RSP: 0018:00000100f2077ed8 EFLAGS: 00010246 RAX: 00000100f2077fd8 RBX: ffffffffa02118a0 RCX: ffffffffffffff40 RDX: 0000000000000000 RSI: 000000000000006c RDI: ffffffffa02118a0 RBP: ffffffffa02118a0 R08: 0000000000003a97 R09: 0000000000000008 R10: 0000000000000000 R11: 0000000000000010 R12: ffffffffffffff40 R13: 00000102fc77ac80 R14: ffffffffa01f253d R15: 00000100f567a800 FS: 0000002a95ef16e0(0000) GS:ffffffff804ee100(0000) knlGS:00000000f7e9c6c0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process dlm_astd (pid: 4496, threadinfo 00000100f2076000, task 00000101fbebf7f0) Stack: 00020100f2e97ce8 00000100f1edee18 00000100f2077f18 0000000000000000 ffffffffa01f1909 00000100f2e97cf8 00000000fffffffc 00000100f2e97ce8 ffffffff8014b930 ffffffff8014b907 Call Trace:<ffffffffa01f1909>{:dlm:dlm_astd+0} <ffffffff8014b930>{keventd_create_kthread+0} <ffffffff8014b907>{kthread+200} <ffffffff80110f47>{child_rip+8} <ffffffff8014b930>{keventd_create_kthread+0} <ffffffff8014b83f>{kthread+0} <ffffffff80110f3f>{child_rip+0} Code: 4d 8b a4 24 c0 00 00 00 48 8d 81 c0 00 00 00 49 81 ec c0 00 RIP <ffffffffa01f1ed3>{:dlm:dlm_astd+1482} RSP <00000100f2077ed8> CR2: 0000000000000000 Version-Release number of selected component (if applicable): dlm-kernel-smp-2.6.9-46.16.0.10-x86_64 How reproducible: Occasional; cluster runs fine for a couple of days then a node panics. Steps to Reproduce: 1. Unknown - involves custom DLM application Actual results: Above oops. Expected results: DLM does not cause an oops with 3rd party application. Additional info:
Created attachment 259951 [details] disassembly of dlm_astd Disassembled dlm_astd with comments/annotations.
First thing to consider is setting the following two config options under /proc/cluster/config/dlm/ to 0: deadlocktime lock_timeout This will disable the sections of code that are likely in question here, which return EDEADLCK and ETIMEDOUT, assuming that the program doesn't rely on these.
Created attachment 260371 [details] revised disassembly of dlm_astd I noticed a mistake in the previous attachment - the pointer manipulation around dlm_astd+1308 (list_for_each_entry_safe() {/* ... */} setup) was all wrong. I think this version is correct - the faulting instruction seems to be the attempt to load the next member from the saved lkb before continuing back around the loop: load safe->lkb_deadlockqueue->next into %r12 For some reason, the prev/next pointers in lkb_deadlockqueue for the last entry in the _deadlockqueue both seem to be NULL.
Need to check I have the correct crash invocation here, but this appears to be the content of the _deadlockqueue at the time of the crash: crash> list _deadlockqueue ffffffffa0211880 100f173acd8 100f173a110 100f173adc0 *** crash> list -H _deadlockqueue dlm_lkb.lkb_deadlockq -s dlm_lkb [...] 100f173ad00 struct dlm_lkb { lkb_flags = 262144, lkb_status = 2, lkb_rqmode = 5 '\005', lkb_grmode = 0 '\0', lkb_retstatus = 4294967285, lkb_id = 65971, lkb_lksb = 0x0, lkb_idtbl_list = { next = 0x100f20cb660, prev = 0x100f20cb660 }, lkb_statequeue = { next = 0x100f173aef8, prev = 0x100f1ede098 }, lkb_resource = 0x100f1ede048, lkb_parent = 0x0, lkb_childcnt = { counter = 0 }, lkb_lockqueue = { next = 0x0, prev = 0x0 }, lkb_lockqueue_state = 0, lkb_lockqueue_flags = 5, lkb_ownpid = 5469, lkb_lockqueue_time = 0, lkb_duetime = 0, lkb_remid = 66260, lkb_nodeid = 2, lkb_astaddr = 0x1, lkb_bastaddr = 0x0, lkb_astparam = 0, lkb_astqueue = { next = 0x0, prev = 0x0 }, lkb_astflags = 0, lkb_bastmode = 0 '\0', lkb_highbast = 0 '\0', lkb_request = 0x0, lkb_deadlockq = { next = 0x0, | *** These should not be NULL! prev = 0x0 | *** }, lkb_lvbseq = 0, lkb_lvbptr = 0x0, lkb_range = 0x0 }
Created attachment 260381 [details] list -H _deadlockqueue dlm_lkb.lkb_deadlockq -s dlm_lkb Complete dump of the deadlockqueue
Bryn's analysis confirms that we should set the deadlocktime to 0 which should avoid this problem. lock_timeout should also be set to 0 to avoid any similar problems there. We might want to consider changing some code to set the NOTIMERS flag on the lockspace to avoid the deadlock queue altogether. Of course, the cause of the apparent list corruption would also be interesting to find.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
This was fixed by setting /proc/cluster/config/dlm/deadlocktime to 0