Bug 197966

Summary: Kernel oops or panic during fast DLM lock take/release loops on SMP x86_64
Product: Red Hat Enterprise Linux 5 Reporter: Lon Hohberger <lhh>
Component: kernelAssignee: David Teigland <teigland>
Status: CLOSED UPSTREAM QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: medium    
Version: 5.0CC: ccaulfie, cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-12-14 17:45:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Reproducer none

Description Lon Hohberger 2006-07-07 18:56:28 UTC
Description of problem:

kernel-2.6.17-1.2339.fc6 
cman-2.0.0-0.fc6.10
2-cpu EM64T


Oops:

Unable to handle kernel paging request at 0000000000200200 RIP:
<ffffffff8844c6d4>{:dlm_device:ast_routine+204}
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /class/misc/dlm_testing_lockspa/dev
CPU 0
Modules linked in: md5 sctp dlm_device dlm configfs autofs4 hidp rfcomm l2cap
bluetooth sunrpc xt_state ip_conntrack nfnetlink xt_tcpudp ip6table_filter
ip6_tables x_tables ipv6 acpi_cpufreq video button battery asus_acpi ac
parport_pc lp parport intel_rng e100 ehci_hcd pcspkr i2c_i801 sg ohci1394 mii
i2c_core ieee1394 uhci_hcd ide_cd cdrom dm_snapshot dm_zero dm_mirror dm_mod
ext3 jbd ata_piix libata sd_mod scsi_mod
Pid: 2354, comm: dlm_astd Not tainted 2.6.17-1.2339.fc6 #1
RIP: 0010:[<ffffffff8844c6d4>] <ffffffff8844c6d4>{:dlm_device:ast_routine+204}
RSP: 0018:ffff81002e4fbe88  EFLAGS: 00010202
RAX: 0000000000200200 RBX: ffff8100347c2c78 RCX: 0000000000000001
RDX: 00000000ffffffff RSI: 0000000000000ddf RDI: ffff8100310514e0
RBP: ffff8100347c2cf8 R08: ffff810031051538 R09: 0000000000000000
R10: ffff810034235cc0 R11: ffff8100347c2c78 R12: ffff81002f5bda00
R13: 0000000000000000 R14: ffff81002f5bda48 R15: ffffffff8029f12d
FS:  0000000000000000(0000) GS:ffffffff8069d000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000200200 CR3: 000000003119b000 CR4: 00000000000006e0
Process dlm_astd (pid: 2354, threadinfo ffff81002e4fa000, task
ffff810030209820)Stack: 0000000000000000 ffff81002d9919e8 0000000000000001
ffffffff8842a2f7
       ffff81002f5bda58 0000000000000000 ffffffff8842a1a6 ffff81002f5bda58
       0000000000000296 ffffffff80234fb7 ffffffffffffffff ffffffffffffffff
Call Trace: <ffffffff8842a2f7>{:dlm:dlm_astd+337} <ffffffff80234fb7>{kthread+246}
       <ffffffff80260892>{child_rip+8}

Code: 48 8b 10 48 39 ea 74 1b 48 89 ee 48 c7 c7 f2 d8 44 88 31 c0
RIP <ffffffff8844c6d4>{:dlm_device:ast_routine+204} RSP <ffff81002e4fbe88>
CR2: 0000000000200200
 BUG: dlm_astd/2354, lock held at task exit time!
 [ffffffff8844a640] {dlm_astd_start}
.. held by:          dlm_astd: 2354 [ffff810030209820, 110]
... acquired at:               dlm_astd+0x49/0x1c0 [dlm]


Panic:
BUG: spinlock bad magic on CPU#1, dlm_astd/2358 (Not tainted)
 lock: ffff810031100ec8, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0

Call Trace: <ffffffff8026dd87>{dump_stack+18} <ffffffff802134e2>{spin_bug+176}
       <ffffffff80207846>{_raw_spin_lock+25}
<ffffffff802672e6>{_spin_lock_irqsave+9}
       <ffffffff8022fca1>{__wake_up+34} <ffffffff8842e2f7>{:dlm:dlm_astd+337}
       <ffffffff80234fb7>{kthread+246} <ffffffff80260892>{child_rip+8}
Kernel panic - not syncing: bad locking

Call Trace: <ffffffff8026dd87>{dump_stack+18} <ffffffff8028f8b6>{panic+134}
       <ffffffff8021350b>{spin_bug+217} <ffffffff80207846>{_raw_spin_lock+25}
       <ffffffff802672e6>{_spin_lock_irqsave+9} <ffffffff8022fca1>{__wake_up+34}
      <ffffffff8842e2f7>{:dlm:dlm_astd+337} <ffffffff80234fb7>{kthread+246}
       <ffffffff80260892>{child_rip+8}


How reproducible: 100%


Steps to Reproduce:
1. gcc -o hammer hammer.c -ldlm -lpthread
2. ./hammer
  
Actual results:  Output as above, and...

[root@crumble usertest]# ./hammer
Opening lockspace testing_lockspace
Failed to open: No such file or directory; trying create.
Lockspace creation attempt complete!
Entering lock loop...
Acquired lock 1 times
...
Acquired lock 688 times

Message from syslogd@crumble at Fri Jul  7 15:02:53 2006 ...
crumble kernel: Oops: 0000 [1] SMP

Message from syslogd@crumble at Fri Jul  7 15:02:53 2006 ...
crumble kernel: CR2: 0000000000200200


Additional info:

Test program simply does sequential LKM_EXMODE lock / unlock in a tight loop
using wrappers around DLM calls (for param validation, waiting/dlm_dispatch
etc.).  There are no other users of the DLM, and no other locks are being
taken/released.  This occurs on one node.

Comment 1 Lon Hohberger 2006-07-07 18:56:29 UTC
Created attachment 132082 [details]
Reproducer

Comment 2 Lon Hohberger 2006-07-07 18:57:17 UTC
Note that one or the other occurs; the oops does not lead to a panic.

Comment 3 Christine Caulfield 2006-07-10 09:30:58 UTC
I'm currently redoing (based on Dave's work) the device code code for the DLM.
so I'll ask you to retry this when it's done if that's OK.

Comment 4 Lon Hohberger 2006-07-10 14:23:46 UTC
Sounds excellent; I look forward to it.

Comment 5 Christine Caulfield 2006-07-14 10:22:22 UTC
This code is now in Steve's git tree


Comment 6 Nate Straz 2007-12-13 17:40:41 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.