Bug 240453

Summary: DLM locking assertion failure line 1390
Product: [Retired] Red Hat Cluster Suite Reporter: Bryn M. Reeves <bmr>
Component: dlm-kernelAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 4CC: ccaulfie, cfeist, cluster-maint, jplans
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0995 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-21 21:55:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 301511    

Description Bryn M. Reeves 2007-05-17 17:04:15 UTC
Description of problem:
DLM:  Assertion failed on line 1390 of file
/usr/src/build/678338-x86_64/BUILD/dlm-kernel-2.6.9-39/smp/src/locking.c
DLM:  assertion:  "lkb->lkb_status == GDLM_LKSTS_CONVERT"
DLM:  time = 8621694772
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at locking:1390
invalid operand: 0000 [1] SMP
CPU 2
Modules linked in: sg cpqci(U) mptctl i2c_dev i2c_core dlm(U) cman(U) md5 ipv6
8021q iptable_nat ipt_REJECT ipt_multiport ipt_state ip_conntrack iptable_filter
ip_tables button battery ac ohci_hcd hw_random tg3 e1000 bonding(U) floppy st
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss mptscsih mptbase sd_mod scsi_mod
Pid: 26490, comm: dlm_astd Tainted: P      2.6.9-22.0.2.ELsmp
RIP: 0010:[<ffffffffa01cd05c>] <ffffffffa01cd05c>{:dlm:conversion_deadlock_check+77}
RSP: 0018:00000102b7f25eb8  EFLAGS: 00010212
RAX: 0000000000000001 RBX: ffffffffa01e7820 RCX:
0000000100000000
RDX: ffffffff803d7d48 RSI: 0000000000000246 RDI:
ffffffff803d7d40
RBP: 00000102bcba9220 R08: ffffffff803d7d48 R09:
ffffffffa01e7820
R10: ffffffff8011de54 R11: ffffffff8011de54 R12:
00000102ca015088
R13: 00000101ff4b9ec0 R14: ffffffffa01c951f R15:
0000010306894400
FS:  0000002a95566780(0000) GS:ffffffff804d3700(0000) knlGS:00000000f7feebb0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003bCR2: 0000002aaef2c000 CR3:
00000003fff90000 CR4:
00000000000006e0
Process dlm_astd (pid: 26490, threadinfo 00000102b7f24000, task 000001006fb93030)
Stack: ffffffffa01e7820 ffffffffa01e7820 ffffffffa01e7740 ffffffffa01c8e9b
       00010103752c3ce8 00000102ca015e58 00000102b7f25f18 0000000000000000
       ffffffffa01c8909 00000103752c3cf8
Call Trace:<ffffffffa01c8e9b>{:dlm:dlm_astd+1426}
<ffffffffa01c8909>{:dlm:dlm_astd+0}
       <ffffffff8014a380>{keventd_create_kthread+0} <ffffffff8014a357>{kthread+200}
       <ffffffff80110ce3>{child_rip+8} <ffffffff8014a380>{keventd_create_kthread+0}
       <ffffffff8014a28f>{kthread+0} <ffffffff80110cdb>{child_rip+0}

Code: 0f 0b 81 ac 1d a0 ff ff ff ff 6e 05 48 c7 c7 89 ac 1d a0 31
RIP <ffffffffa01cd05c>{:dlm:conversion_deadlock_check+77} RSP <00000102b7f25eb8>
 <0>Kernel panic - not syncing: Oops


Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-22.0.2.EL-x86_64
dlm-1.0.0-5-x86_64
dlm-devel-1.0.0-5-x86_64
dlm-kernel-smp-2.6.9-39.1.2-x86_64
dlm-kernel-2.6.9-39.1.2-x86_64
dlm-kernheaders-2.6.9-39.1.2-x86_64
rgmanager-1.9.43-0-x86_64
magma-1.0.3-2-x86_64
cman-kernel-smp-2.6.9-41.0.2-x86_64
cman-kernel-2.6.9-41.0.2-x86_64
cman-devel-1.0.4-0-x86_64
magma-plugins-1.0.5-0-x86_64
system-config-cluster-1.0.16-1.0-noarch
cman-kernheaders-2.6.9-41.0.2-x86_64
cman-1.0.4-0-x86_64
magma-devel-1.0.3-2-x86_64


How reproducible:
Difficult

Steps to Reproduce:
This has been seen in two and three node clusters which do not use GFS and do
not have any rgmanager services defined. Each time it has occured on one of the
remaining nodes following eviction of another node (missed heartbeat caused by
sysrq-t over slow serial consoles).

Actual results:
Above backtrace.

Expected results:
Remaining nodes recover following eviction of node that missed heartbeat.

Comment 3 David Teigland 2007-05-17 18:08:21 UTC
This is code that we've never used or tested; I'm surprised it works at all!

Just so expectations are set appropriately, if you use the rhel4 dlm for
anything beyond gfs/clvm/rgmanager, you're in uncharted territory and will
definately find a lot of broken things.  Rewriting the dlm (the result being in
rhel5) was the only way to make the dlm more generally usable.  In rhel5
it's definately our aim to make the dlm work in general for user's apps.
(The kind of deadlock detection involved in this bug is a feature that I'm
working on right now, actually, and is planned for 5.1.)

Now, on to this specific bug in conversion_deadlock_check(), it should be pretty
trivial to fix, I'd suggest changing

  DLM_ASSERT(lkb->lkb_status == GDLM_LKSTS_CONVERT,);

into

  if (kb->lkb_status != GDLM_LKSTS_CONVERT)
    return NULL;


Comment 4 David Teigland 2007-05-30 18:06:53 UTC
If the person reporting this problem can test and confirm that the
change in comment 3 works, then I'll check in that change.


Comment 6 David Teigland 2007-08-14 17:15:57 UTC
patch added to cvs

Checking in locking.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/locking.c,v  <--  locking.c
new revision: 1.50.2.11; previous revision: 1.50.2.10
done


Comment 18 errata-xmlrpc 2007-11-21 21:55:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0995.html


Comment 19 Charlie Brady 2007-12-04 22:07:49 UTC
(In reply to comment #6)
> patch added to cvs

I notice that this patch isn't included in the STABLE branch in CVS, but is in
RHEL46 branch.