Running: iogen -o -m random -s write,writev,readv -t 1b -T1000b 10000b:tfile1 | doio -avk on a 6 node cluster produced: DLM: Assertion failed on line 973 of file /usr/src/cluster/dlm-kernel/src/lockqueue.c DLM: assertion: "lkb" DLM: time = 12776900 dlm: reply rh_cmd 5 rh_lkid 25403dc lockstate 3989 nodeid 64 status 0 lkid f509be98 nodeid 5 ------------[ cut here ]------------ kernel BUG at /usr/src/cluster/dlm-kernel/src/lockqueue.c:973! invalid operand: 0000 [#1] SMP Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<f8a8a685>] Not tainted EFLAGS: 00010246 (2.6.7) EIP is at process_cluster_request+0x7e5/0xd30 [dlm] eax: 00000001 ebx: 00000000 ecx: f7473e4c edx: 0000872a esi: c237f800 edi: f7473f04 ebp: 00000000 esp: f7473e48 ds: 007b es: 007b ss: 0068 Process dlm_recvd (pid: 3684, threadinfo=f7472000 task=f62f5160) Stack: f8a97e00 00000005 f8a98f8c f8a97dfc 00c2f5c4 f63b2c90 0000003c f63b2b6c 00000000 00000000 f5cc1cc4 00000005 c035afc0 f7473fa4 f7473fa4 c02d0128 00000fc4 00000040 00004000 f7473e98 00000000 c035ca80 00001000 f625b500 Call Trace: [<c02d0128>] inet_recvmsg+0x48/0x70 [<c0285f1c>] sock_recvmsg+0xbc/0xc0 [<f8a8e8c3>] midcomms_process_incoming_buffer+0x173/0x250 [dlm] [<c0285f1c>] sock_recvmsg+0xbc/0xc0 [<f8a8c202>] receive_from_sock+0x142/0x320 [dlm] [<f8a8d239>] process_sockets+0xa9/0xd0 [dlm] [<f8a8d52d>] dlm_recvd+0x9d/0xf0 [dlm] [<f8a8d490>] dlm_recvd+0x0/0xf0 [dlm] [<c01042b5>] kernel_thread_helper+0x5/0x10 Code: 0f 0b cd 03 8c 8f a9 f8 e9 e8 fa ff ff 8b 57 0c 89 f0 e8 14 Version-Release number of selected component (if applicable): DLM <CVS> (built Aug 20 2004 13:05:57) installed How reproducible: Didn't try Steps to Reproduce: 1. iogen -o -m random -s write,writev,readv -t 1b -T1000b 10000b:tfile1 | doio -avk on all 6 nodes of a six node cluster. 2. Wait several hours... 3. Additional info: This was hit while attempting to verify 126757.
I ran this for about 24 hours on 8 nodes without a problem. I have 4 SMP machines I can also try.
I'm sorry, I should have noted that this was on SMP.
I've been running this for 6 hours on my 4 SMP machines with no problem. I'll let it continue running.
Over 24 hours on 4 SMP machines and nothing. I'll let this one sit until someone can reproduce it.
Updates with the proper version and component name.
Ran for 19 hours, and did not hit this again.