Description of problem: Best I can tell, nodeid_to_ipaddr is being called with a nodeid that has been removed from the cluster and after while, cman doesn't like it. device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server No address list for nodeid 0 device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server No address list for nodeid 0 device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server No address list for nodeid 0 device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server SM: 01000001 sm_stop: SG still joined Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: d0a134c9 *pde = 00000000 Oops: 0000 [#1] Modules linked in: dm_cmirror autofs4 nfs lockd dlm cman md5 ipv6 sunrpc ohci_hcd i2c_piix4 i2c_core e100 mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod lpfc scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<d0a134c9>] Tainted: GF VLI EFLAGS: 00010246 (2.6.12) EIP is at find_node_by_nodeid+0x89/0x170 [cman] eax: 00000000 ebx: 00000000 ecx: 00000000 edx: c4a39400 esi: cfbe7dd8 edi: cfbe7de8 ebp: cfbe7d68 esp: cfbe7d48 ds: 007b es: 007b ss: 0068 Process kmirrord/0 (pid: 276, threadinfo=cfbe6000 task=cfee3000) Stack: c0109e85 cbcbb280 00000001 c0103c5e 00000246 cfbe7dd8 cfbe7de8 00000000 cfbe7d70 d0a0b938 cfbe7d84 d09157ae 00000202 cfbe7e30 cbcbb280 cfbe7df4 d09117db cffdbc40 cfec7aa0 000039a5 cfebfe78 c0399bb2 00000000 00000000 Call Trace: [<c0109e85>] end_8259A_irq+0x25/0x30 [<c0103c5e>] common_interrupt+0x1a/0x20 [<d0a0b938>] kcl_get_node_addresses+0x8/0x20 [cman] [<d09157ae>] nodeid_to_ipaddr+0xe/0x50 [dm_cmirror] [<d09117db>] _consult_server+0xbb/0x340 [dm_cmirror] [<c0399bb2>] schedule+0x362/0x790 [<c0103c5e>] common_interrupt+0x1a/0x20 [<d0911f06>] consult_server+0x4a6/0x6e0 [dm_cmirror] [<c011cf80>] default_wake_function+0x0/0x10 [<c011cf80>] default_wake_function+0x0/0x10 [<d091308d>] cluster_get_resync_work+0x1d/0x20 [dm_cmirror] [<d0833336>] __rh_recovery_prepare+0x16/0x1d0 [dm_mirror] [<c0122d57>] printk+0x17/0x20 [<d09131be>] cluster_get_sync_count+0xde/0x140 [dm_cmirror] [<d0833516>] rh_recovery_prepare+0x26/0x50 [dm_mirror] [<d0833b8b>] do_recovery+0x1b/0xa0 [dm_mirror] [<d0834ab9>] do_mirror+0x119/0x1d0 [dm_mirror] [<d0834ba7>] do_work+0x37/0x70 [dm_mirror] [<d0834b70>] do_work+0x0/0x70 [dm_mirror] [<c013aaf1>] worker_thread+0x211/0x440 [<c011cf80>] default_wake_function+0x0/0x10 [<c011cf80>] default_wake_function+0x0/0x10 [<c013a8e0>] worker_thread+0x0/0x440 [<c0141536>] kthread+0x96/0xe0 [<c01414a0>] kthread+0x0/0xe0 [<c010132d>] kernel_thread_helper+0x5/0x18 Code: e0 a4 a2 d0 3c 4b 24 1d b8 01 00 00 00 a3 e4 a4 a2 d0 b8 80 d4 a1 d0 a3 f0 a4 a2 d0 b8 93 0b 00 00 a3 f4 a4 a2 d0 a1 fc a4 a2 d0 <8b> 1c 98 74 27 c7 04 24 48 d4 a1 d0 b9 e0 a4 a2 d0 ba 95 0b 00 Entering kdb (current=0xcfee3000, pid 276) Oops: Oops due to oops @ 0xd0a134c9 eax = 0x00000000 ebx = 0x00000000 ecx = 0x00000000 edx = 0xc4a39400 esi = 0xcfbe7dd8 edi = 0xcfbe7de8 esp = 0xcfbe7d48 eip = 0xd0a134c9 ebp = 0xcfbe7d68 xss = 0xc0310068 xcs = 0x00000060 eflags = 0x00010246 xds = 0xc4a3007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xcfbe7d14 kdb> btp 276 Stack traceback for pid 276 0xcfee3000 276 5 1 0 R 0xcfee31c0 *kmirrord/0 ESP EIP Function (args) 0xcfbe7b94 0xd0a134c9 [cman]find_node_by_nodeid+0x89 0xcfbe7d70 0xd0a0b938 [cman]kcl_get_node_addresses+0x8 (0x202, 0xcfbe7e30, 0xcbcbb280) 0xcfbe7d78 0xd09157ae [dm_cmirror]nodeid_to_ipaddr+0xe (0xcffdbc40, 0xcfec7aa0,0x39a5, 0xcfebfe78, 0xc0399bb2) 0xcfbe7d8c 0xd09117db [dm_cmirror]_consult_server+0xbb (0x6, 0xcfbe7eb8, 0xcfbe7e30, 0xd09167c6, 0x400) 0xcfbe7dfc 0xd0911f06 [dm_cmirror]consult_server+0x4a6 (0x6, 0xcfbe7eb8) 0xcfbe7e88 0xd091308d [dm_cmirror]cluster_get_resync_work+0x1d (0xcd431800, 0xcfbe7ed8, 0xc0122d57, 0xd0916790, 0xcfbe7eb4) 0xcfbe7e98 0xd0833336 [dm_mirror]__rh_recovery_prepare+0x16 (0xc4a3980c) 0xcfbe7ed4 0xd0833516 [dm_mirror]rh_recovery_prepare+0x26 (0xc8fd6a40, 0xc4a39800, 0xd0838b40, 0xd0838e20) 0xcfbe7ee0 0xd0833b8b [dm_mirror]do_recovery+0x1b (0xcfae03e8, 0x1, 0x3, 0xcfbe7f38, 0x96) 0xcfbe7ef8 0xd0834ab9 [dm_mirror]do_mirror+0x119 (0x0, 0xcfae03a0) 0xcfbe7f30 0xd0834ba7 [dm_mirror]do_work+0x37 (0x0, 0x0, 0x0, 0x0, 0x1) 0xcfbe7f40 0xc013aaf1 worker_thread+0x211 0xcfbe7fcc 0xc0141536 kthread+0x96 0xcfbe7fec 0xc010132d kernel_thread_helper+0x5 Version-Release number of selected component (if applicable): RHEL4/ using STABLE branch of cluster tree from Jul 27 13:08 How reproducible: Haven't tried Steps to Reproduce: It seems that cmirror got itself into a bind and then this problem occurred. I don't know how to produce this problem directly or with short steps... 1. install all cmirror patches/components 2. do looping pvmove/lvs --all/lmdd on different machines to same LV 3. wait for oops Actual results: panic Expected results: loop indefinitly about nodeid_to_ipaddr if you have to, but don't panic
That message: SM: 01000001 sm_stop: SG still joined bothers me. It might be the case that cman has shut down (thus freeing its list of nodes) before you've called kcl_get_node_addresses(). There's no loop in get_node_by_nodeid, it simply indexes an array (with bounds checking) so this seems the most likely scenario. In fact, looking at the code, there is an OBO error that allows a request for nodeid==0 even when the cluster is down. This is the real bug I think.
Here are a couple more panics seen in cman membership on cman-1.0.4-0 at bootup: > --------- Kernel BUG at membership:279 invalid operand: 0000 [1] SMP > > Entering kdb (current=0x00000100e63c27f0, pid 7400) on processor 0 > Oops: <NULL> due to oops @ 0xffffffffa01d70dd > r15 = 0x000001007d7b5dc8 r14 = 0x00000100e78b6400 > r13 = 0x0000000000000003 r12 = 0x0000000000000003 > rbp = 0x000001007d7b5e38 rbx = 0x000001007d7b5da8 > r11 = 0xffffffff8011bac4 r10 = 0x0000000100000000 > r9 = 0x000001007d7b5da8 r8 = 0xffffffff803dc9c8 > rax = 0x000000000000001e rcx = 0xffffffff803dc9c8 > rdx = 0xffffffff803dc9c8 rsi = 0x0000000000000246 > rdi = 0xffffffff803dc9c0 orig_rax = 0xffffffffffffffff > rip = 0xffffffffa01d70dd cs = 0x0000000000000010 > eflags = 0x0000000000010212 rsp = 0x000001007d7b5d30 > ss = 0x000001007d7b4000 ®s = 0x000001007d7b5c98 [0]kdb> bt > Stack traceback for pid 7400 > 0x00000100e63c27f0 7400 1 1 0 R > 0x00000100e63c2bf0 *cman_memb > RSP RIP Function (args) > 0x1007d7b5d30 0xffffffffa01d70dd > [cman]dispatch_messages+0x11ce (0x0, 0x100e5c78a80, 0x20100001e, > 0x900020104, 0x0) > 0x1007d7b5e48 0xffffffffa01d8180 [cman]membership_kthread+0xbd9 > 0x1007d7b5f58 0xffffffff8010fdeb child_rip+0x8 [0]kdb> > > > lizard03 > ----------- [cut here ] --------- [please bite here ] > --------- Kernel BUG at membership:3150 invalid operand: 0000 [1] SMP > > Entering kdb (current=0x000001007cda1030, pid 7494) on processor 0 > Oops: <NULL> due to oops @ 0xffffffffa01d4131 > r15 = 0x0000000000000003 r14 = 0xffffffffa01ee5e0 > r13 = 0x00000100e2961a78 r12 = 0xffffffffffffffff > rbp = 0xffffffffa01ee750 rbx = 0xffffffffa01ee5e0 > r11 = 0xffffffff80000000 r10 = 0xffffffff80000000 > r9 = 0x0000000000000001 r8 = 0x0000000000000004 > rax = 0x0000000000000000 rcx = 0x0000000000000080 > rdx = 0x0000000000000080 rsi = 0x0000000000000080 > rdi = 0x000001007c851e08 orig_rax = 0xffffffffffffffff > rip = 0xffffffffa01d4131 cs = 0x0000000000000010 > eflags = 0x0000000000010246 rsp = 0x000001007c851df0 > ss = 0x000001007c850000 ®s = 0x000001007c851d58 [0]kdb> bt > Stack traceback for pid 7494 > 0x000001007cda1030 7494 1 1 0 R > 0x000001007cda1430 *cman_memb > RSP RIP Function (args) > 0x1007c851df0 0xffffffffa01d4131 [cman]elect_master+0x3a > (0x100e322a180, 0x27c850712) > 0x1007c851e08 0xffffffffa01d5373 [cman]a_node_just_died+0x186 > (0xffffffffa01ee640, 0x1007cda1030) > 0x1007c851e28 0xffffffffa01d56a6 > [cman]process_dead_nodes+0x5e (0x100005efa120000, 0x23d10d00000039, > 0x2014b001e, 0x1400020104, 0xd00000039010000) > 0x1007c851e48 0xffffffffa01d8158 [cman]membership_kthread+0xbb1 > 0x1007c851f58 0xffffffff8010fdeb child_rip+0x8 [0]kdb>
I've seen this on the mailing list too. cman is trying to elect a master node but there are no active nodes in the cluster (it thinks), so it can carry on. Are there any other cman-related messages on the other nodes and earlier on in the syslog of this node ?
Here's an attempt to fix this -rSTABLE: Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.18.6.4; previous revision: 1.44.2.18.6.3 done -rRHEL4: Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.22; previous revision: 1.44.2.21 done
Here's are a few cman messages that QA saw before the panic. Once we get the new cman build we'll verify it's fixed. CMAN: removing node link-07 from the cluster : No response to messages CMAN: Attempt to re-add node with id 1 CMAN: existing node is link-01 CMAN: new node is link-08 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at membership:279 invalid operand: 0000 [1] SMP CPU 1 Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod Pid: 4055, comm: cman_memb Not tainted 2.6.9-34.ELsmp RIP: 0010:[<ffffffffa02150d5>] <ffffffffa02150d5>{:cman:dispatch_messages+4558} RSP: 0018:0000010039f2dd48 EFLAGS: 00010212 RAX: 000000000000001d RBX: 0000010039f2dda8 RCX: 0000000000000246 RDX: 0000000000004b39 RSI: 0000000000000246 RDI: ffffffff803d9e60 RBP: 0000010039f2de38 R08: 000000000000000d R09: 0000010039f2dda8 R10: 0000000100000000 R11: 0000000000000000 R12: 0000000000000001 R13: 0000000000000001 R14: 000001003d26fe00 R15: 0000010039f2ddc8 FS: 0000002a95574b00(0000) GS:ffffffff804d7b80(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a9556c000 CR3: 0000000037e22000 CR4: 00000000000006e0 Process cman_memb (pid: 4055, threadinfo 0000010039f2c000, task 0000010039a91030) Stack: 0000010039a91030 0000010020712a60 0000000000000000 ffffffff80304a85 0000010039f2de38 ffffffff80304add 000000001ff89030 0000010039a92a00 0000000000000000 0000000000000074 Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff80304add>{thread_return+88} <ffffffffa0216178>{:cman:membership_kthread+3033} <ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0} <ffffffff8013212e>{schedule_tail+55} <ffffffff80110e17>{child_rip+8} <ffffffffa021559f>{:cman:membership_kthread+0} <ffffffff80110e0f>{child_rip+0} Code: 0f 0b e3 de 21 a0 ff ff ff ff 17 01 48 c7 c7 60 d2 22 a0 e8 RIP <ffffffffa02150d5>{:cman:dispatch_messages+4558} RSP <0000010039f2dd48> <0>Kernel panic - not syncing: Oops
*** Bug 190230 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0559.html