Description of problem: DLM: Assertion failed on line 1390 of file /usr/src/build/678338-x86_64/BUILD/dlm-kernel-2.6.9-39/smp/src/locking.c DLM: assertion: "lkb->lkb_status == GDLM_LKSTS_CONVERT" DLM: time = 8621694772 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at locking:1390 invalid operand: 0000 [1] SMP CPU 2 Modules linked in: sg cpqci(U) mptctl i2c_dev i2c_core dlm(U) cman(U) md5 ipv6 8021q iptable_nat ipt_REJECT ipt_multiport ipt_state ip_conntrack iptable_filter ip_tables button battery ac ohci_hcd hw_random tg3 e1000 bonding(U) floppy st dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss mptscsih mptbase sd_mod scsi_mod Pid: 26490, comm: dlm_astd Tainted: P 2.6.9-22.0.2.ELsmp RIP: 0010:[<ffffffffa01cd05c>] <ffffffffa01cd05c>{:dlm:conversion_deadlock_check+77} RSP: 0018:00000102b7f25eb8 EFLAGS: 00010212 RAX: 0000000000000001 RBX: ffffffffa01e7820 RCX: 0000000100000000 RDX: ffffffff803d7d48 RSI: 0000000000000246 RDI: ffffffff803d7d40 RBP: 00000102bcba9220 R08: ffffffff803d7d48 R09: ffffffffa01e7820 R10: ffffffff8011de54 R11: ffffffff8011de54 R12: 00000102ca015088 R13: 00000101ff4b9ec0 R14: ffffffffa01c951f R15: 0000010306894400 FS: 0000002a95566780(0000) GS:ffffffff804d3700(0000) knlGS:00000000f7feebb0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003bCR2: 0000002aaef2c000 CR3: 00000003fff90000 CR4: 00000000000006e0 Process dlm_astd (pid: 26490, threadinfo 00000102b7f24000, task 000001006fb93030) Stack: ffffffffa01e7820 ffffffffa01e7820 ffffffffa01e7740 ffffffffa01c8e9b 00010103752c3ce8 00000102ca015e58 00000102b7f25f18 0000000000000000 ffffffffa01c8909 00000103752c3cf8 Call Trace:<ffffffffa01c8e9b>{:dlm:dlm_astd+1426} <ffffffffa01c8909>{:dlm:dlm_astd+0} <ffffffff8014a380>{keventd_create_kthread+0} <ffffffff8014a357>{kthread+200} <ffffffff80110ce3>{child_rip+8} <ffffffff8014a380>{keventd_create_kthread+0} <ffffffff8014a28f>{kthread+0} <ffffffff80110cdb>{child_rip+0} Code: 0f 0b 81 ac 1d a0 ff ff ff ff 6e 05 48 c7 c7 89 ac 1d a0 31 RIP <ffffffffa01cd05c>{:dlm:conversion_deadlock_check+77} RSP <00000102b7f25eb8> <0>Kernel panic - not syncing: Oops Version-Release number of selected component (if applicable): kernel-smp-2.6.9-22.0.2.EL-x86_64 dlm-1.0.0-5-x86_64 dlm-devel-1.0.0-5-x86_64 dlm-kernel-smp-2.6.9-39.1.2-x86_64 dlm-kernel-2.6.9-39.1.2-x86_64 dlm-kernheaders-2.6.9-39.1.2-x86_64 rgmanager-1.9.43-0-x86_64 magma-1.0.3-2-x86_64 cman-kernel-smp-2.6.9-41.0.2-x86_64 cman-kernel-2.6.9-41.0.2-x86_64 cman-devel-1.0.4-0-x86_64 magma-plugins-1.0.5-0-x86_64 system-config-cluster-1.0.16-1.0-noarch cman-kernheaders-2.6.9-41.0.2-x86_64 cman-1.0.4-0-x86_64 magma-devel-1.0.3-2-x86_64 How reproducible: Difficult Steps to Reproduce: This has been seen in two and three node clusters which do not use GFS and do not have any rgmanager services defined. Each time it has occured on one of the remaining nodes following eviction of another node (missed heartbeat caused by sysrq-t over slow serial consoles). Actual results: Above backtrace. Expected results: Remaining nodes recover following eviction of node that missed heartbeat.
This is code that we've never used or tested; I'm surprised it works at all! Just so expectations are set appropriately, if you use the rhel4 dlm for anything beyond gfs/clvm/rgmanager, you're in uncharted territory and will definately find a lot of broken things. Rewriting the dlm (the result being in rhel5) was the only way to make the dlm more generally usable. In rhel5 it's definately our aim to make the dlm work in general for user's apps. (The kind of deadlock detection involved in this bug is a feature that I'm working on right now, actually, and is planned for 5.1.) Now, on to this specific bug in conversion_deadlock_check(), it should be pretty trivial to fix, I'd suggest changing DLM_ASSERT(lkb->lkb_status == GDLM_LKSTS_CONVERT,); into if (kb->lkb_status != GDLM_LKSTS_CONVERT) return NULL;
If the person reporting this problem can test and confirm that the change in comment 3 works, then I'll check in that change.
patch added to cvs Checking in locking.c; /cvs/cluster/cluster/dlm-kernel/src/Attic/locking.c,v <-- locking.c new revision: 1.50.2.11; previous revision: 1.50.2.10 done
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0995.html
(In reply to comment #6) > patch added to cvs I notice that this patch isn't included in the STABLE branch in CVS, but is in RHEL46 branch.