Description of problem: 6 node cluster, running a moderate IO load. Tank-06 is noticed to be missing sending heartbeats and is fenced by tank-01. Tank-03 paincs, which then causes tank-05 to panic: Jan 21 16:32:12 tank-03 kernel: CMAN: removing node tank-06 from the cluster : Missed too many heartbeats Jan 21 16:32:13 tank-03 kernel: dlm: vedder: restbl_rsb_update failed -1 Jan 21 16:32:13 tank-03 fenced[4025]: fencing deferred to tank-01 Jan 21 16:32:33 tank-03 kernel: Unable to handle kernel paging request at virtual address 9b100030 Jan 21 16:32:33 tank-03 kernel: printing eip: Jan 21 16:32:33 tank-03 kernel: f8aa6ea7 Jan 21 16:32:33 tank-03 kernel: *pde = 341ed001 Jan 21 16:32:33 tank-03 kernel: Oops: 0000 [#1] Jan 21 16:32:33 tank-03 kernel: SMP Jan 21 16:32:33 tank-03 kernel: Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U) dm_mod md5 ipv6 parport_pc lp parport autofs4 sunrpc e1000 microcode uhci_hcd ehci_hcd button battery ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Jan 21 16:32:33 tank-03 kernel: CPU: 0 Jan 21 16:32:33 tank-03 kernel: EIP: 0060:[<f8aa6ea7>] Tainted: GF VLIJan 21 16:32:33 tank-03 kernel: EFLAGS: 00010286 (2.6.9-5.ELsmp) Jan 21 16:32:33 tank-03 kernel: EIP is at dlm_dir_rebuild_send+0x12f/0x2a7 [dlm]Jan 21 16:32:33 tank-03 kernel: eax: f7fe9720 ebx: 0e000d00 ecx: 9b0ffff0 edx: 00000000 Jan 21 16:32:33 tank-03 kernel: esi: f4abafe5 edi: f2eb2694 ebp: f2eb267c esp: f67f5e40 Jan 21 16:32:33 tank-03 kernel: ds: 007b es: 007b ss: 0068 Jan 21 16:32:33 tank-03 kernel: Process dlm_recvd (pid: 4609, threadinfo=f67f5000 task=f4ee5130) Jan 21 16:32:33 tank-03 kernel: Stack: 00000668 00000246 00000000 f7ffd180 9b0ffff0 00000680 9b100000 f7fe9600 Jan 21 16:32:33 tank-03 kernel: 02000000 00001800 f4b3968c 00000feb f7fe9600 00000004 f67f5f00 f8ab1b85 Jan 21 16:32:33 tank-03 kernel: f2eb2014 00000feb 00000004 00004040 00001000 f2eb2000 00000004 00000000 Jan 21 16:32:33 tank-03 kernel: Call Trace: Jan 21 16:32:33 tank-03 kernel: [<f8ab1b85>] rcom_process_message+0x194/0x4ac [dlm] Jan 21 16:32:33 tank-03 kernel: [<f8ab1f61>] process_reply_sync+0xc4/0xcb [dlm] Jan 21 16:32:33 tank-03 kernel: [<f8ab20c4>] process_recovery_comm+0x3b/0xaa [dlm] Jan 21 16:32:33 tank-03 kernel: [<f8aade00>] midcomms_process_incoming_buffer+0x1ba/0x1f6 [dlm] Jan 21 16:32:33 tank-03 kernel: [<c011e912>] autoremove_wake_function+0x0/0x2d Jan 21 16:32:33 tank-03 kernel: [<c013efe8>] __alloc_pages+0xb4/0x298 Jan 21 16:32:33 tank-03 kernel: [<f8aac072>] receive_from_sock+0x192/0x26c [dlm] Jan 21 16:32:33 tank-03 kernel: [<f8aacf27>] dlm_recvd+0x0/0x95 [dlm] Jan 21 16:32:33 tank-03 kernel: [<f8aacdd9>] process_sockets+0x52/0x85 [dlm] Jan 21 16:32:33 tank-03 kernel: [<f8aacfac>] dlm_recvd+0x85/0x95 [dlm] Jan 21 16:32:33 tank-03 kernel: [<c0131dcd>] kthread+0x73/0x9b Jan 21 16:32:33 tank-03 kernel: [<c0131d5a>] kthread+0x0/0x9b Jan 21 16:32:33 tank-03 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Jan 21 16:32:33 tank-03 kernel: Code: 00 89 54 24 18 c7 44 24 14 00 00 00 00 8b 44 24 1c 05 20 01 00 00 39 44 24 18 0f 84 0b 01 00 00 8b 4c 24 18 83 e9 10 89 4c 24 10 <8b> 59 40 85 db 0f 85 db 00 00 00 89 c8 e8 34 fa ff ff 3b 44 24 ------ Jan 21 16:32:13 tank-05 kernel: CMAN: removing node tank-03 from the cluster : Missed too many heartbeats Jan 21 16:32:14 tank-05 kernel: SM: 01000005 process_recovery_barrier status=-104 Jan 21 16:32:14 tank-05 kernel: dlm: vedder: dlm_dir_rebuild_local failed -1 Jan 21 16:32:14 tank-05 fenced[4025]: fencing deferred to tank-01 Jan 21 16:32:33 tank-05 kernel: Unable to handle kernel paging request at virtual address 75ffff44 Jan 21 16:32:33 tank-05 kernel: printing eip: Jan 21 16:32:33 tank-05 kernel: f8aa6843 Jan 21 16:32:33 tank-05 kernel: *pde = 00000000 Jan 21 16:32:33 tank-05 kernel: Oops: 0000 [#1] Jan 21 16:32:33 tank-05 kernel: SMP Jan 21 16:32:33 tank-05 kernel: Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U) dm_mod md5 ipv6 parport_pc lp parport autofs4 sunrpc e1000 microcode uhci_hcd ehci_hcd button battery ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Jan 21 16:32:33 tank-05 kernel: CPU: 0 Jan 21 16:32:33 tank-05 kernel: EIP: 0060:[<f8aa6843>] Tainted: GF VLIJan 21 16:32:33 tank-05 kernel: EFLAGS: 00010246 (2.6.9-5.ELsmp) Jan 21 16:32:33 tank-05 kernel: EIP is at name_to_directory_nodeid+0x9/0xb3 [dlm] Jan 21 16:32:33 tank-05 kernel: eax: f4750088 ebx: 75ffff00 ecx: 000000f4 edx: f4750088 Jan 21 16:32:33 tank-05 kernel: esi: f47d7b85 edi: 00000000 ebp: f2142c9c esp: f61b4e30 Jan 21 16:32:33 tank-05 kernel: ds: 007b es: 007b ss: 0068 Jan 21 16:32:33 tank-05 kernel: Process dlm_recvd (pid: 4615, threadinfo=f61b4000 task=f4de03b0) Jan 21 16:32:33 tank-05 kernel: Stack: 00000000 f47d7b85 f2142cb4 f8aa6eb9 00000c88 00000246 00000000 f7ffd180 Jan 21 16:32:33 tank-05 kernel: f4750007 00000ca0 f4750017 c2262600 04000000 00001800 f38ef54c 00000feb Jan 21 16:32:33 tank-05 kernel: c2262600 00000001 f61b4f00 f8ab1b85 f2142014 00000feb 00000001 00004040 Jan 21 16:32:33 tank-05 kernel: Call Trace: Jan 21 16:32:33 tank-05 kernel: [<f8aa6eb9>] dlm_dir_rebuild_send+0x141/0x2a7 [dlm] Jan 21 16:32:33 tank-05 kernel: [<f8ab1b85>] rcom_process_message+0x194/0x4ac [dlm] Jan 21 16:32:33 tank-05 kernel: [<f8ab1f61>] process_reply_sync+0xc4/0xcb [dlm] Jan 21 16:32:33 tank-05 kernel: [<f8ab20c4>] process_recovery_comm+0x3b/0xaa [dlm] Jan 21 16:32:33 tank-05 kernel: [<f8aade00>] midcomms_process_incoming_buffer+0x1ba/0x1f6 [dlm] Jan 21 16:32:33 tank-05 kernel: [<c011e912>] autoremove_wake_function+0x0/0x2d Jan 21 16:32:33 tank-05 kernel: [<c013efe8>] __alloc_pages+0xb4/0x298 Jan 21 16:32:33 tank-05 kernel: [<f8aac072>] receive_from_sock+0x192/0x26c [dlm] Jan 21 16:32:33 tank-05 kernel: [<f8aacf27>] dlm_recvd+0x0/0x95 [dlm] Jan 21 16:32:33 tank-05 kernel: [<f8aacdd9>] process_sockets+0x52/0x85 [dlm] Jan 21 16:32:33 tank-05 kernel: [<f8aacfac>] dlm_recvd+0x85/0x95 [dlm] Jan 21 16:32:33 tank-05 kernel: [<c0131dcd>] kthread+0x73/0x9b Jan 21 16:32:33 tank-05 kernel: [<c0131d5a>] kthread+0x0/0x9b Jan 21 16:32:33 tank-05 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Jan 21 16:32:33 tank-05 kernel: Code: 89 c8 c7 01 00 01 10 00 c7 41 04 00 02 20 00 e8 f9 71 00 00 eb d0 8d 83 04 01 00 00 5b e9 57 e6 81 c7 57 31 ff 56 53 89 c3 89 d0 <83> 7b 44 01 75 0a e8 18 78 00 00 e9 96 00 00 00 89 ca e8 be e2 DLM <CVS> (built Jan 18 2005 13:36:03) installed Lock_DLM (built Jan 18 2005 14:28:48) installed Have not tried to reproduce.
I hit this again with the 01-21-2005 builds.
There's a fair chance that today's check-in has fixed this.
Fix verified.