Description of problem: kernel-2.6.17-1.2339.fc6 cman-2.0.0-0.fc6.10 2-cpu EM64T Oops: Unable to handle kernel paging request at 0000000000200200 RIP: <ffffffff8844c6d4>{:dlm_device:ast_routine+204} PGD 0 Oops: 0000 [1] SMP last sysfs file: /class/misc/dlm_testing_lockspa/dev CPU 0 Modules linked in: md5 sctp dlm_device dlm configfs autofs4 hidp rfcomm l2cap bluetooth sunrpc xt_state ip_conntrack nfnetlink xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 acpi_cpufreq video button battery asus_acpi ac parport_pc lp parport intel_rng e100 ehci_hcd pcspkr i2c_i801 sg ohci1394 mii i2c_core ieee1394 uhci_hcd ide_cd cdrom dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd ata_piix libata sd_mod scsi_mod Pid: 2354, comm: dlm_astd Not tainted 2.6.17-1.2339.fc6 #1 RIP: 0010:[<ffffffff8844c6d4>] <ffffffff8844c6d4>{:dlm_device:ast_routine+204} RSP: 0018:ffff81002e4fbe88 EFLAGS: 00010202 RAX: 0000000000200200 RBX: ffff8100347c2c78 RCX: 0000000000000001 RDX: 00000000ffffffff RSI: 0000000000000ddf RDI: ffff8100310514e0 RBP: ffff8100347c2cf8 R08: ffff810031051538 R09: 0000000000000000 R10: ffff810034235cc0 R11: ffff8100347c2c78 R12: ffff81002f5bda00 R13: 0000000000000000 R14: ffff81002f5bda48 R15: ffffffff8029f12d FS: 0000000000000000(0000) GS:ffffffff8069d000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000200200 CR3: 000000003119b000 CR4: 00000000000006e0 Process dlm_astd (pid: 2354, threadinfo ffff81002e4fa000, task ffff810030209820)Stack: 0000000000000000 ffff81002d9919e8 0000000000000001 ffffffff8842a2f7 ffff81002f5bda58 0000000000000000 ffffffff8842a1a6 ffff81002f5bda58 0000000000000296 ffffffff80234fb7 ffffffffffffffff ffffffffffffffff Call Trace: <ffffffff8842a2f7>{:dlm:dlm_astd+337} <ffffffff80234fb7>{kthread+246} <ffffffff80260892>{child_rip+8} Code: 48 8b 10 48 39 ea 74 1b 48 89 ee 48 c7 c7 f2 d8 44 88 31 c0 RIP <ffffffff8844c6d4>{:dlm_device:ast_routine+204} RSP <ffff81002e4fbe88> CR2: 0000000000200200 BUG: dlm_astd/2354, lock held at task exit time! [ffffffff8844a640] {dlm_astd_start} .. held by: dlm_astd: 2354 [ffff810030209820, 110] ... acquired at: dlm_astd+0x49/0x1c0 [dlm] Panic: BUG: spinlock bad magic on CPU#1, dlm_astd/2358 (Not tainted) lock: ffff810031100ec8, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 Call Trace: <ffffffff8026dd87>{dump_stack+18} <ffffffff802134e2>{spin_bug+176} <ffffffff80207846>{_raw_spin_lock+25} <ffffffff802672e6>{_spin_lock_irqsave+9} <ffffffff8022fca1>{__wake_up+34} <ffffffff8842e2f7>{:dlm:dlm_astd+337} <ffffffff80234fb7>{kthread+246} <ffffffff80260892>{child_rip+8} Kernel panic - not syncing: bad locking Call Trace: <ffffffff8026dd87>{dump_stack+18} <ffffffff8028f8b6>{panic+134} <ffffffff8021350b>{spin_bug+217} <ffffffff80207846>{_raw_spin_lock+25} <ffffffff802672e6>{_spin_lock_irqsave+9} <ffffffff8022fca1>{__wake_up+34} <ffffffff8842e2f7>{:dlm:dlm_astd+337} <ffffffff80234fb7>{kthread+246} <ffffffff80260892>{child_rip+8} How reproducible: 100% Steps to Reproduce: 1. gcc -o hammer hammer.c -ldlm -lpthread 2. ./hammer Actual results: Output as above, and... [root@crumble usertest]# ./hammer Opening lockspace testing_lockspace Failed to open: No such file or directory; trying create. Lockspace creation attempt complete! Entering lock loop... Acquired lock 1 times ... Acquired lock 688 times Message from syslogd@crumble at Fri Jul 7 15:02:53 2006 ... crumble kernel: Oops: 0000 [1] SMP Message from syslogd@crumble at Fri Jul 7 15:02:53 2006 ... crumble kernel: CR2: 0000000000200200 Additional info: Test program simply does sequential LKM_EXMODE lock / unlock in a tight loop using wrappers around DLM calls (for param validation, waiting/dlm_dispatch etc.). There are no other users of the DLM, and no other locks are being taken/released. This occurs on one node.
Created attachment 132082 [details] Reproducer
Note that one or the other occurs; the oops does not lead to a panic.
I'm currently redoing (based on Dave's work) the device code code for the DLM. so I'll ask you to retry this when it's done if that's OK.
Sounds excellent; I look forward to it.
This code is now in Steve's git tree
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.