Bug 145831

Summary: dlm_recvd panic after node removed from cluster
Product: [Retired] Red Hat Cluster Suite Reporter: Dean Jansa <djansa>
Component: dlmAssignee: David Teigland <teigland>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-02-01 16:36:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dean Jansa 2005-01-21 23:13:53 UTC
Description of problem:

6 node cluster, running a moderate IO load.  Tank-06 is noticed to be
missing sending heartbeats and is fenced by tank-01.  Tank-03 paincs,
which then causes tank-05  to panic:


Jan 21 16:32:12 tank-03 kernel: CMAN: removing node tank-06 from the
cluster : Missed too many heartbeats
Jan 21 16:32:13 tank-03 kernel: dlm: vedder: restbl_rsb_update failed -1
Jan 21 16:32:13 tank-03 fenced[4025]: fencing deferred to tank-01
Jan 21 16:32:33 tank-03 kernel: Unable to handle kernel paging request
at virtual address 9b100030
Jan 21 16:32:33 tank-03 kernel:  printing eip:
Jan 21 16:32:33 tank-03 kernel: f8aa6ea7
Jan 21 16:32:33 tank-03 kernel: *pde = 341ed001
Jan 21 16:32:33 tank-03 kernel: Oops: 0000 [#1]
Jan 21 16:32:33 tank-03 kernel: SMP
Jan 21 16:32:33 tank-03 kernel: Modules linked in: gnbd(U)
lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U)
dm_mod md5 ipv6 parport_pc lp parport autofs4 sunrpc e1000 microcode
uhci_hcd ehci_hcd button battery ac ext3 jbd qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
Jan 21 16:32:33 tank-03 kernel: CPU:    0
Jan 21 16:32:33 tank-03 kernel: EIP:    0060:[<f8aa6ea7>]    Tainted:
GF     VLIJan 21 16:32:33 tank-03 kernel: EFLAGS: 00010286  
(2.6.9-5.ELsmp)
Jan 21 16:32:33 tank-03 kernel: EIP is at
dlm_dir_rebuild_send+0x12f/0x2a7 [dlm]Jan 21 16:32:33 tank-03 kernel:
eax: f7fe9720   ebx: 0e000d00   ecx: 9b0ffff0   edx: 00000000
Jan 21 16:32:33 tank-03 kernel: esi: f4abafe5   edi: f2eb2694   ebp:
f2eb267c   esp: f67f5e40
Jan 21 16:32:33 tank-03 kernel: ds: 007b   es: 007b   ss: 0068
Jan 21 16:32:33 tank-03 kernel: Process dlm_recvd (pid: 4609,
threadinfo=f67f5000 task=f4ee5130)
Jan 21 16:32:33 tank-03 kernel: Stack: 00000668 00000246 00000000
f7ffd180 9b0ffff0 00000680 9b100000 f7fe9600
Jan 21 16:32:33 tank-03 kernel:        02000000 00001800 f4b3968c
00000feb f7fe9600 00000004 f67f5f00 f8ab1b85
Jan 21 16:32:33 tank-03 kernel:        f2eb2014 00000feb 00000004
00004040 00001000 f2eb2000 00000004 00000000
Jan 21 16:32:33 tank-03 kernel: Call Trace:
Jan 21 16:32:33 tank-03 kernel:  [<f8ab1b85>]
rcom_process_message+0x194/0x4ac [dlm]
Jan 21 16:32:33 tank-03 kernel:  [<f8ab1f61>]
process_reply_sync+0xc4/0xcb [dlm]
Jan 21 16:32:33 tank-03 kernel:  [<f8ab20c4>]
process_recovery_comm+0x3b/0xaa [dlm]
Jan 21 16:32:33 tank-03 kernel:  [<f8aade00>]
midcomms_process_incoming_buffer+0x1ba/0x1f6 [dlm]
Jan 21 16:32:33 tank-03 kernel:  [<c011e912>]
autoremove_wake_function+0x0/0x2d
Jan 21 16:32:33 tank-03 kernel:  [<c013efe8>] __alloc_pages+0xb4/0x298
Jan 21 16:32:33 tank-03 kernel:  [<f8aac072>]
receive_from_sock+0x192/0x26c [dlm]
Jan 21 16:32:33 tank-03 kernel:  [<f8aacf27>] dlm_recvd+0x0/0x95 [dlm]
Jan 21 16:32:33 tank-03 kernel:  [<f8aacdd9>]
process_sockets+0x52/0x85 [dlm]
Jan 21 16:32:33 tank-03 kernel:  [<f8aacfac>] dlm_recvd+0x85/0x95 [dlm]
Jan 21 16:32:33 tank-03 kernel:  [<c0131dcd>] kthread+0x73/0x9b
Jan 21 16:32:33 tank-03 kernel:  [<c0131d5a>] kthread+0x0/0x9b
Jan 21 16:32:33 tank-03 kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Jan 21 16:32:33 tank-03 kernel: Code: 00 89 54 24 18 c7 44 24 14 00 00
00 00 8b 44 24 1c 05 20 01 00 00 39 44 24 18 0f 84 0b 01 00 00 8b 4c
24 18 83 e9 10 89 4c 24 10 <8b> 59 40 85 db 0f 85 db 00 00 00 89 c8 e8
34 fa ff ff 3b 44 24

------

Jan 21 16:32:13 tank-05 kernel: CMAN: removing node tank-03 from the
cluster : Missed too many heartbeats
Jan 21 16:32:14 tank-05 kernel: SM: 01000005 process_recovery_barrier
status=-104
Jan 21 16:32:14 tank-05 kernel: dlm: vedder: dlm_dir_rebuild_local
failed -1
Jan 21 16:32:14 tank-05 fenced[4025]: fencing deferred to tank-01
Jan 21 16:32:33 tank-05 kernel: Unable to handle kernel paging request
at virtual address 75ffff44
Jan 21 16:32:33 tank-05 kernel:  printing eip:
Jan 21 16:32:33 tank-05 kernel: f8aa6843
Jan 21 16:32:33 tank-05 kernel: *pde = 00000000
Jan 21 16:32:33 tank-05 kernel: Oops: 0000 [#1]
Jan 21 16:32:33 tank-05 kernel: SMP
Jan 21 16:32:33 tank-05 kernel: Modules linked in: gnbd(U)
lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U)
dm_mod md5 ipv6 parport_pc lp parport autofs4 sunrpc e1000 microcode
uhci_hcd ehci_hcd button battery ac ext3 jbd qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
Jan 21 16:32:33 tank-05 kernel: CPU:    0
Jan 21 16:32:33 tank-05 kernel: EIP:    0060:[<f8aa6843>]    Tainted:
GF     VLIJan 21 16:32:33 tank-05 kernel: EFLAGS: 00010246  
(2.6.9-5.ELsmp)
Jan 21 16:32:33 tank-05 kernel: EIP is at
name_to_directory_nodeid+0x9/0xb3 [dlm]
Jan 21 16:32:33 tank-05 kernel: eax: f4750088   ebx: 75ffff00   ecx:
000000f4   edx: f4750088
Jan 21 16:32:33 tank-05 kernel: esi: f47d7b85   edi: 00000000   ebp:
f2142c9c   esp: f61b4e30
Jan 21 16:32:33 tank-05 kernel: ds: 007b   es: 007b   ss: 0068
Jan 21 16:32:33 tank-05 kernel: Process dlm_recvd (pid: 4615,
threadinfo=f61b4000 task=f4de03b0)
Jan 21 16:32:33 tank-05 kernel: Stack: 00000000 f47d7b85 f2142cb4
f8aa6eb9 00000c88 00000246 00000000 f7ffd180
Jan 21 16:32:33 tank-05 kernel:        f4750007 00000ca0 f4750017
c2262600 04000000 00001800 f38ef54c 00000feb
Jan 21 16:32:33 tank-05 kernel:        c2262600 00000001 f61b4f00
f8ab1b85 f2142014 00000feb 00000001 00004040
Jan 21 16:32:33 tank-05 kernel: Call Trace:
Jan 21 16:32:33 tank-05 kernel:  [<f8aa6eb9>]
dlm_dir_rebuild_send+0x141/0x2a7 [dlm]
Jan 21 16:32:33 tank-05 kernel:  [<f8ab1b85>]
rcom_process_message+0x194/0x4ac [dlm]
Jan 21 16:32:33 tank-05 kernel:  [<f8ab1f61>]
process_reply_sync+0xc4/0xcb [dlm]
Jan 21 16:32:33 tank-05 kernel:  [<f8ab20c4>]
process_recovery_comm+0x3b/0xaa [dlm]
Jan 21 16:32:33 tank-05 kernel:  [<f8aade00>]
midcomms_process_incoming_buffer+0x1ba/0x1f6 [dlm]
Jan 21 16:32:33 tank-05 kernel:  [<c011e912>]
autoremove_wake_function+0x0/0x2d
Jan 21 16:32:33 tank-05 kernel:  [<c013efe8>] __alloc_pages+0xb4/0x298
Jan 21 16:32:33 tank-05 kernel:  [<f8aac072>]
receive_from_sock+0x192/0x26c [dlm]
Jan 21 16:32:33 tank-05 kernel:  [<f8aacf27>] dlm_recvd+0x0/0x95 [dlm]
Jan 21 16:32:33 tank-05 kernel:  [<f8aacdd9>]
process_sockets+0x52/0x85 [dlm]
Jan 21 16:32:33 tank-05 kernel:  [<f8aacfac>] dlm_recvd+0x85/0x95 [dlm]
Jan 21 16:32:33 tank-05 kernel:  [<c0131dcd>] kthread+0x73/0x9b
Jan 21 16:32:33 tank-05 kernel:  [<c0131d5a>] kthread+0x0/0x9b
Jan 21 16:32:33 tank-05 kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Jan 21 16:32:33 tank-05 kernel: Code: 89 c8 c7 01 00 01 10 00 c7 41 04
00 02 20 00 e8 f9 71 00 00 eb d0 8d 83 04 01 00 00 5b e9 57 e6 81 c7
57 31 ff 56 53 89 c3 89 d0 <83> 7b 44 01 75 0a e8 18 78 00 00 e9 96 00
00 00 89 ca e8 be e2

DLM <CVS> (built Jan 18 2005 13:36:03) installed
Lock_DLM (built Jan 18 2005 14:28:48) installed

Have not tried to reproduce.

Comment 1 Dean Jansa 2005-01-24 21:50:16 UTC
I hit this again with the 01-21-2005 builds.

Comment 2 David Teigland 2005-01-27 09:33:52 UTC
There's a fair chance that today's check-in has fixed this.

Comment 3 Dean Jansa 2005-02-01 16:36:08 UTC
Fix verified.