Bug 164535

Summary:	cman causing kernel panic
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Jonathan Earl Brassow <jbrassow>
Component:	cman	Assignee:	Christine Caulfield <ccaulfie>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	cluster-maint, cmarthal
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2006-0559	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-08-10 21:32:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jonathan Earl Brassow 2005-07-28 15:42:49 UTC

Description of problem:
Best I can tell, nodeid_to_ipaddr is being called with a nodeid that has been
removed from the cluster and after while, cman doesn't like it.

device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server
No address list for nodeid 0
device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server
No address list for nodeid 0
device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server
No address list for nodeid 0
device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server
SM: 01000001 sm_stop: SG still joined
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
d0a134c9
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: dm_cmirror autofs4 nfs lockd dlm cman md5 ipv6 sunrpc
ohci_hcd i2c_piix4 i2c_core e100 mii floppy dm_snapshot dm_zero dm_mirror ext3
jbd dm_mod lpfc scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<d0a134c9>]    Tainted: GF     VLI
EFLAGS: 00010246   (2.6.12)
EIP is at find_node_by_nodeid+0x89/0x170 [cman]
eax: 00000000   ebx: 00000000   ecx: 00000000   edx: c4a39400
esi: cfbe7dd8   edi: cfbe7de8   ebp: cfbe7d68   esp: cfbe7d48
ds: 007b   es: 007b   ss: 0068
Process kmirrord/0 (pid: 276, threadinfo=cfbe6000 task=cfee3000)
Stack: c0109e85 cbcbb280 00000001 c0103c5e 00000246 cfbe7dd8 cfbe7de8 00000000
       cfbe7d70 d0a0b938 cfbe7d84 d09157ae 00000202 cfbe7e30 cbcbb280 cfbe7df4
       d09117db cffdbc40 cfec7aa0 000039a5 cfebfe78 c0399bb2 00000000 00000000
Call Trace:
 [<c0109e85>] end_8259A_irq+0x25/0x30
 [<c0103c5e>] common_interrupt+0x1a/0x20
 [<d0a0b938>] kcl_get_node_addresses+0x8/0x20 [cman]
 [<d09157ae>] nodeid_to_ipaddr+0xe/0x50 [dm_cmirror]
 [<d09117db>] _consult_server+0xbb/0x340 [dm_cmirror]
 [<c0399bb2>] schedule+0x362/0x790
 [<c0103c5e>] common_interrupt+0x1a/0x20
 [<d0911f06>] consult_server+0x4a6/0x6e0 [dm_cmirror]
 [<c011cf80>] default_wake_function+0x0/0x10
 [<c011cf80>] default_wake_function+0x0/0x10
 [<d091308d>] cluster_get_resync_work+0x1d/0x20 [dm_cmirror]
 [<d0833336>] __rh_recovery_prepare+0x16/0x1d0 [dm_mirror]
 [<c0122d57>] printk+0x17/0x20
 [<d09131be>] cluster_get_sync_count+0xde/0x140 [dm_cmirror]
 [<d0833516>] rh_recovery_prepare+0x26/0x50 [dm_mirror]
 [<d0833b8b>] do_recovery+0x1b/0xa0 [dm_mirror]
 [<d0834ab9>] do_mirror+0x119/0x1d0 [dm_mirror]
 [<d0834ba7>] do_work+0x37/0x70 [dm_mirror]
 [<d0834b70>] do_work+0x0/0x70 [dm_mirror]
 [<c013aaf1>] worker_thread+0x211/0x440
 [<c011cf80>] default_wake_function+0x0/0x10
 [<c011cf80>] default_wake_function+0x0/0x10
 [<c013a8e0>] worker_thread+0x0/0x440
 [<c0141536>] kthread+0x96/0xe0
 [<c01414a0>] kthread+0x0/0xe0
 [<c010132d>] kernel_thread_helper+0x5/0x18
Code: e0 a4 a2 d0 3c 4b 24 1d b8 01 00 00 00 a3 e4 a4 a2 d0 b8 80 d4 a1 d0 a3 f0
a4 a2 d0 b8 93 0b 00 00 a3 f4 a4 a2 d0 a1 fc a4 a2 d0 <8b> 1c 98 74 27 c7 04 24
48 d4 a1 d0 b9 e0 a4 a2 d0 ba 95 0b 00

Entering kdb (current=0xcfee3000, pid 276) Oops: Oops
due to oops @ 0xd0a134c9
eax = 0x00000000 ebx = 0x00000000 ecx = 0x00000000 edx = 0xc4a39400
esi = 0xcfbe7dd8 edi = 0xcfbe7de8 esp = 0xcfbe7d48 eip = 0xd0a134c9
ebp = 0xcfbe7d68 xss = 0xc0310068 xcs = 0x00000060 eflags = 0x00010246
xds = 0xc4a3007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xcfbe7d14
kdb> btp 276
Stack traceback for pid 276
0xcfee3000      276        5  1    0   R  0xcfee31c0 *kmirrord/0
ESP        EIP        Function (args)
0xcfbe7b94 0xd0a134c9 [cman]find_node_by_nodeid+0x89
0xcfbe7d70 0xd0a0b938 [cman]kcl_get_node_addresses+0x8 (0x202, 0xcfbe7e30,
0xcbcbb280)
0xcfbe7d78 0xd09157ae [dm_cmirror]nodeid_to_ipaddr+0xe (0xcffdbc40,
0xcfec7aa0,0x39a5, 0xcfebfe78, 0xc0399bb2)
0xcfbe7d8c 0xd09117db [dm_cmirror]_consult_server+0xbb (0x6, 0xcfbe7eb8,
0xcfbe7e30, 0xd09167c6, 0x400)
0xcfbe7dfc 0xd0911f06 [dm_cmirror]consult_server+0x4a6 (0x6, 0xcfbe7eb8)
0xcfbe7e88 0xd091308d [dm_cmirror]cluster_get_resync_work+0x1d (0xcd431800,
0xcfbe7ed8, 0xc0122d57, 0xd0916790, 0xcfbe7eb4)
0xcfbe7e98 0xd0833336 [dm_mirror]__rh_recovery_prepare+0x16 (0xc4a3980c)
0xcfbe7ed4 0xd0833516 [dm_mirror]rh_recovery_prepare+0x26 (0xc8fd6a40,
0xc4a39800, 0xd0838b40, 0xd0838e20)
0xcfbe7ee0 0xd0833b8b [dm_mirror]do_recovery+0x1b (0xcfae03e8, 0x1, 0x3,
0xcfbe7f38, 0x96)
0xcfbe7ef8 0xd0834ab9 [dm_mirror]do_mirror+0x119 (0x0, 0xcfae03a0)
0xcfbe7f30 0xd0834ba7 [dm_mirror]do_work+0x37 (0x0, 0x0, 0x0, 0x0, 0x1)
0xcfbe7f40 0xc013aaf1 worker_thread+0x211
0xcfbe7fcc 0xc0141536 kthread+0x96
0xcfbe7fec 0xc010132d kernel_thread_helper+0x5


Version-Release number of selected component (if applicable):
RHEL4/ using STABLE branch of cluster tree from Jul 27 13:08

How reproducible:
Haven't tried

Steps to Reproduce:
It seems that cmirror got itself into a bind and then this problem occurred.  I
don't know how to produce this problem directly or with short steps...

1. install all cmirror patches/components
2. do looping pvmove/lvs --all/lmdd on different machines to same LV
3. wait for oops
  
Actual results:
panic

Expected results:
loop indefinitly about nodeid_to_ipaddr if you have to, but don't panic

Comment 1 Christine Caulfield 2005-08-01 10:04:39 UTC

That message:
 SM: 01000001 sm_stop: SG still joined

bothers me. It might be the case that cman has shut down (thus freeing its list
of nodes) before you've called kcl_get_node_addresses().

There's no loop in get_node_by_nodeid, it simply indexes an array (with bounds
checking) so this seems the most likely scenario. In fact, looking at the code,
there is an OBO error that allows a request for nodeid==0 even when the cluster
is down. This is the real bug I think.

Comment 2 Henry Harris 2006-04-10 22:01:20 UTC

Here are a couple more panics seen in cman membership on cman-1.0.4-0 at 
bootup:

> --------- Kernel BUG at membership:279 invalid operand: 0000 [1] SMP
> 
> Entering kdb (current=0x00000100e63c27f0, pid 7400) on processor 0 
> Oops: <NULL> due to oops @ 0xffffffffa01d70dd
>      r15 = 0x000001007d7b5dc8      r14 = 0x00000100e78b6400
>      r13 = 0x0000000000000003      r12 = 0x0000000000000003
>      rbp = 0x000001007d7b5e38      rbx = 0x000001007d7b5da8
>      r11 = 0xffffffff8011bac4      r10 = 0x0000000100000000
>       r9 = 0x000001007d7b5da8       r8 = 0xffffffff803dc9c8
>      rax = 0x000000000000001e      rcx = 0xffffffff803dc9c8
>      rdx = 0xffffffff803dc9c8      rsi = 0x0000000000000246
>      rdi = 0xffffffff803dc9c0 orig_rax = 0xffffffffffffffff
>      rip = 0xffffffffa01d70dd       cs = 0x0000000000000010
>   eflags = 0x0000000000010212      rsp = 0x000001007d7b5d30
>       ss = 0x000001007d7b4000 &regs = 0x000001007d7b5c98 [0]kdb> bt 
> Stack traceback for pid 7400
> 0x00000100e63c27f0     7400        1  1    0   R  
> 0x00000100e63c2bf0 *cman_memb
> RSP           RIP                Function (args)
> 0x1007d7b5d30 0xffffffffa01d70dd
> [cman]dispatch_messages+0x11ce (0x0, 0x100e5c78a80, 0x20100001e, 
> 0x900020104, 0x0)
> 0x1007d7b5e48 0xffffffffa01d8180 [cman]membership_kthread+0xbd9
> 0x1007d7b5f58 0xffffffff8010fdeb child_rip+0x8 [0]kdb>
> 
> 
> lizard03
> ----------- [cut here ] --------- [please bite here ]
> --------- Kernel BUG at membership:3150 invalid operand: 0000 [1] SMP
> 
> Entering kdb (current=0x000001007cda1030, pid 7494) on processor 0 
> Oops: <NULL> due to oops @ 0xffffffffa01d4131
>      r15 = 0x0000000000000003      r14 = 0xffffffffa01ee5e0
>      r13 = 0x00000100e2961a78      r12 = 0xffffffffffffffff
>      rbp = 0xffffffffa01ee750      rbx = 0xffffffffa01ee5e0
>      r11 = 0xffffffff80000000      r10 = 0xffffffff80000000
>       r9 = 0x0000000000000001       r8 = 0x0000000000000004
>      rax = 0x0000000000000000      rcx = 0x0000000000000080
>      rdx = 0x0000000000000080      rsi = 0x0000000000000080
>      rdi = 0x000001007c851e08 orig_rax = 0xffffffffffffffff
>      rip = 0xffffffffa01d4131       cs = 0x0000000000000010
>   eflags = 0x0000000000010246      rsp = 0x000001007c851df0
>       ss = 0x000001007c850000 &regs = 0x000001007c851d58 [0]kdb> bt 
> Stack traceback for pid 7494
> 0x000001007cda1030     7494        1  1    0   R  
> 0x000001007cda1430 *cman_memb
> RSP           RIP                Function (args)
> 0x1007c851df0 0xffffffffa01d4131 [cman]elect_master+0x3a 
> (0x100e322a180, 0x27c850712)
> 0x1007c851e08 0xffffffffa01d5373 [cman]a_node_just_died+0x186 
> (0xffffffffa01ee640, 0x1007cda1030)
> 0x1007c851e28 0xffffffffa01d56a6
> [cman]process_dead_nodes+0x5e (0x100005efa120000, 0x23d10d00000039, 
> 0x2014b001e, 0x1400020104, 0xd00000039010000)
> 0x1007c851e48 0xffffffffa01d8158 [cman]membership_kthread+0xbb1
> 0x1007c851f58 0xffffffff8010fdeb child_rip+0x8 [0]kdb>

Comment 3 Christine Caulfield 2006-04-11 07:29:38 UTC

I've seen this on the mailing list too. cman is trying to elect a master node
but there are no active nodes in the cluster (it thinks), so it can carry on.

Are there any other cman-related messages on the other nodes and earlier on in
the syslog of this node ?

Comment 4 Christine Caulfield 2006-04-19 15:35:26 UTC

Here's an attempt to fix this

-rSTABLE:
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision: 1.44.2.18.6.4; previous revision: 1.44.2.18.6.3
done

-rRHEL4:
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision: 1.44.2.22; previous revision: 1.44.2.21
done

Comment 5 Corey Marthaler 2006-05-01 15:52:11 UTC

Here's are a few cman messages that QA saw before the panic. Once we get the new
cman build we'll verify it's fixed.


CMAN: removing node link-07 from the cluster : No response to messages
CMAN: Attempt to re-add node with id 1
CMAN: existing node is link-01
CMAN: new node is link-08
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at membership:279
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U)
md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptsas
mptspi mptfc mptscsi mptbase sd_mod scsi_mod
Pid: 4055, comm: cman_memb Not tainted 2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa02150d5>] <ffffffffa02150d5>{:cman:dispatch_messages+4558}
RSP: 0018:0000010039f2dd48  EFLAGS: 00010212
RAX: 000000000000001d RBX: 0000010039f2dda8 RCX: 0000000000000246
RDX: 0000000000004b39 RSI: 0000000000000246 RDI: ffffffff803d9e60
RBP: 0000010039f2de38 R08: 000000000000000d R09: 0000010039f2dda8
R10: 0000000100000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000001 R14: 000001003d26fe00 R15: 0000010039f2ddc8
FS:  0000002a95574b00(0000) GS:ffffffff804d7b80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9556c000 CR3: 0000000037e22000 CR4: 00000000000006e0
Process cman_memb (pid: 4055, threadinfo 0000010039f2c000, task 0000010039a91030)
Stack: 0000010039a91030 0000010020712a60 0000000000000000 ffffffff80304a85
       0000010039f2de38 ffffffff80304add 000000001ff89030 0000010039a92a00
       0000000000000000 0000000000000074
Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff80304add>{thread_return+88}
       <ffffffffa0216178>{:cman:membership_kthread+3033}
<ffffffff801333c8>{default_wake_function+0}
       <ffffffff801333c8>{default_wake_function+0}
<ffffffff801333c8>{default_wake_function+0}
       <ffffffff8013212e>{schedule_tail+55} <ffffffff80110e17>{child_rip+8}
       <ffffffffa021559f>{:cman:membership_kthread+0}
<ffffffff80110e0f>{child_rip+0}


Code: 0f 0b e3 de 21 a0 ff ff ff ff 17 01 48 c7 c7 60 d2 22 a0 e8
RIP <ffffffffa02150d5>{:cman:dispatch_messages+4558} RSP <0000010039f2dd48>
 <0>Kernel panic - not syncing: Oops

Comment 6 Christine Caulfield 2006-05-02 07:51:52 UTC

*** Bug 190230 has been marked as a duplicate of this bug. ***

Comment 9 Red Hat Bugzilla 2006-08-10 21:32:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0559.html