Bug 164535 - cman causing kernel panic
cman causing kernel panic
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
: 190230 (view as bug list)
Depends On:
  Show dependency treegraph
Reported: 2005-07-28 11:42 EDT by Jonathan Earl Brassow
Modified: 2009-04-16 16:00 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2006-0559
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2006-08-10 17:32:10 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Jonathan Earl Brassow 2005-07-28 11:42:49 EDT
Description of problem:
Best I can tell, nodeid_to_ipaddr is being called with a nodeid that has been
removed from the cluster and after while, cman doesn't like it.

device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server
No address list for nodeid 0
device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server
No address list for nodeid 0
device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server
No address list for nodeid 0
device-mapper: Unable to convert nodeid_to_ipaddr in _consult_server
SM: 01000001 sm_stop: SG still joined
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: dm_cmirror autofs4 nfs lockd dlm cman md5 ipv6 sunrpc
ohci_hcd i2c_piix4 i2c_core e100 mii floppy dm_snapshot dm_zero dm_mirror ext3
jbd dm_mod lpfc scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<d0a134c9>]    Tainted: GF     VLI
EFLAGS: 00010246   (2.6.12)
EIP is at find_node_by_nodeid+0x89/0x170 [cman]
eax: 00000000   ebx: 00000000   ecx: 00000000   edx: c4a39400
esi: cfbe7dd8   edi: cfbe7de8   ebp: cfbe7d68   esp: cfbe7d48
ds: 007b   es: 007b   ss: 0068
Process kmirrord/0 (pid: 276, threadinfo=cfbe6000 task=cfee3000)
Stack: c0109e85 cbcbb280 00000001 c0103c5e 00000246 cfbe7dd8 cfbe7de8 00000000
       cfbe7d70 d0a0b938 cfbe7d84 d09157ae 00000202 cfbe7e30 cbcbb280 cfbe7df4
       d09117db cffdbc40 cfec7aa0 000039a5 cfebfe78 c0399bb2 00000000 00000000
Call Trace:
 [<c0109e85>] end_8259A_irq+0x25/0x30
 [<c0103c5e>] common_interrupt+0x1a/0x20
 [<d0a0b938>] kcl_get_node_addresses+0x8/0x20 [cman]
 [<d09157ae>] nodeid_to_ipaddr+0xe/0x50 [dm_cmirror]
 [<d09117db>] _consult_server+0xbb/0x340 [dm_cmirror]
 [<c0399bb2>] schedule+0x362/0x790
 [<c0103c5e>] common_interrupt+0x1a/0x20
 [<d0911f06>] consult_server+0x4a6/0x6e0 [dm_cmirror]
 [<c011cf80>] default_wake_function+0x0/0x10
 [<c011cf80>] default_wake_function+0x0/0x10
 [<d091308d>] cluster_get_resync_work+0x1d/0x20 [dm_cmirror]
 [<d0833336>] __rh_recovery_prepare+0x16/0x1d0 [dm_mirror]
 [<c0122d57>] printk+0x17/0x20
 [<d09131be>] cluster_get_sync_count+0xde/0x140 [dm_cmirror]
 [<d0833516>] rh_recovery_prepare+0x26/0x50 [dm_mirror]
 [<d0833b8b>] do_recovery+0x1b/0xa0 [dm_mirror]
 [<d0834ab9>] do_mirror+0x119/0x1d0 [dm_mirror]
 [<d0834ba7>] do_work+0x37/0x70 [dm_mirror]
 [<d0834b70>] do_work+0x0/0x70 [dm_mirror]
 [<c013aaf1>] worker_thread+0x211/0x440
 [<c011cf80>] default_wake_function+0x0/0x10
 [<c011cf80>] default_wake_function+0x0/0x10
 [<c013a8e0>] worker_thread+0x0/0x440
 [<c0141536>] kthread+0x96/0xe0
 [<c01414a0>] kthread+0x0/0xe0
 [<c010132d>] kernel_thread_helper+0x5/0x18
Code: e0 a4 a2 d0 3c 4b 24 1d b8 01 00 00 00 a3 e4 a4 a2 d0 b8 80 d4 a1 d0 a3 f0
a4 a2 d0 b8 93 0b 00 00 a3 f4 a4 a2 d0 a1 fc a4 a2 d0 <8b> 1c 98 74 27 c7 04 24
48 d4 a1 d0 b9 e0 a4 a2 d0 ba 95 0b 00

Entering kdb (current=0xcfee3000, pid 276) Oops: Oops
due to oops @ 0xd0a134c9
eax = 0x00000000 ebx = 0x00000000 ecx = 0x00000000 edx = 0xc4a39400
esi = 0xcfbe7dd8 edi = 0xcfbe7de8 esp = 0xcfbe7d48 eip = 0xd0a134c9
ebp = 0xcfbe7d68 xss = 0xc0310068 xcs = 0x00000060 eflags = 0x00010246
xds = 0xc4a3007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xcfbe7d14
kdb> btp 276
Stack traceback for pid 276
0xcfee3000      276        5  1    0   R  0xcfee31c0 *kmirrord/0
ESP        EIP        Function (args)
0xcfbe7b94 0xd0a134c9 [cman]find_node_by_nodeid+0x89
0xcfbe7d70 0xd0a0b938 [cman]kcl_get_node_addresses+0x8 (0x202, 0xcfbe7e30,
0xcfbe7d78 0xd09157ae [dm_cmirror]nodeid_to_ipaddr+0xe (0xcffdbc40,
0xcfec7aa0,0x39a5, 0xcfebfe78, 0xc0399bb2)
0xcfbe7d8c 0xd09117db [dm_cmirror]_consult_server+0xbb (0x6, 0xcfbe7eb8,
0xcfbe7e30, 0xd09167c6, 0x400)
0xcfbe7dfc 0xd0911f06 [dm_cmirror]consult_server+0x4a6 (0x6, 0xcfbe7eb8)
0xcfbe7e88 0xd091308d [dm_cmirror]cluster_get_resync_work+0x1d (0xcd431800,
0xcfbe7ed8, 0xc0122d57, 0xd0916790, 0xcfbe7eb4)
0xcfbe7e98 0xd0833336 [dm_mirror]__rh_recovery_prepare+0x16 (0xc4a3980c)
0xcfbe7ed4 0xd0833516 [dm_mirror]rh_recovery_prepare+0x26 (0xc8fd6a40,
0xc4a39800, 0xd0838b40, 0xd0838e20)
0xcfbe7ee0 0xd0833b8b [dm_mirror]do_recovery+0x1b (0xcfae03e8, 0x1, 0x3,
0xcfbe7f38, 0x96)
0xcfbe7ef8 0xd0834ab9 [dm_mirror]do_mirror+0x119 (0x0, 0xcfae03a0)
0xcfbe7f30 0xd0834ba7 [dm_mirror]do_work+0x37 (0x0, 0x0, 0x0, 0x0, 0x1)
0xcfbe7f40 0xc013aaf1 worker_thread+0x211
0xcfbe7fcc 0xc0141536 kthread+0x96
0xcfbe7fec 0xc010132d kernel_thread_helper+0x5

Version-Release number of selected component (if applicable):
RHEL4/ using STABLE branch of cluster tree from Jul 27 13:08

How reproducible:
Haven't tried

Steps to Reproduce:
It seems that cmirror got itself into a bind and then this problem occurred.  I
don't know how to produce this problem directly or with short steps...

1. install all cmirror patches/components
2. do looping pvmove/lvs --all/lmdd on different machines to same LV
3. wait for oops
Actual results:

Expected results:
loop indefinitly about nodeid_to_ipaddr if you have to, but don't panic
Comment 1 Christine Caulfield 2005-08-01 06:04:39 EDT
That message:
 SM: 01000001 sm_stop: SG still joined

bothers me. It might be the case that cman has shut down (thus freeing its list
of nodes) before you've called kcl_get_node_addresses().

There's no loop in get_node_by_nodeid, it simply indexes an array (with bounds
checking) so this seems the most likely scenario. In fact, looking at the code,
there is an OBO error that allows a request for nodeid==0 even when the cluster
is down. This is the real bug I think.
Comment 2 Henry Harris 2006-04-10 18:01:20 EDT
Here are a couple more panics seen in cman membership on cman-1.0.4-0 at 

> --------- Kernel BUG at membership:279 invalid operand: 0000 [1] SMP
> Entering kdb (current=0x00000100e63c27f0, pid 7400) on processor 0 
> Oops: <NULL> due to oops @ 0xffffffffa01d70dd
>      r15 = 0x000001007d7b5dc8      r14 = 0x00000100e78b6400
>      r13 = 0x0000000000000003      r12 = 0x0000000000000003
>      rbp = 0x000001007d7b5e38      rbx = 0x000001007d7b5da8
>      r11 = 0xffffffff8011bac4      r10 = 0x0000000100000000
>       r9 = 0x000001007d7b5da8       r8 = 0xffffffff803dc9c8
>      rax = 0x000000000000001e      rcx = 0xffffffff803dc9c8
>      rdx = 0xffffffff803dc9c8      rsi = 0x0000000000000246
>      rdi = 0xffffffff803dc9c0 orig_rax = 0xffffffffffffffff
>      rip = 0xffffffffa01d70dd       cs = 0x0000000000000010
>   eflags = 0x0000000000010212      rsp = 0x000001007d7b5d30
>       ss = 0x000001007d7b4000 &regs = 0x000001007d7b5c98 [0]kdb> bt 
> Stack traceback for pid 7400
> 0x00000100e63c27f0     7400        1  1    0   R  
> 0x00000100e63c2bf0 *cman_memb
> RSP           RIP                Function (args)
> 0x1007d7b5d30 0xffffffffa01d70dd
> [cman]dispatch_messages+0x11ce (0x0, 0x100e5c78a80, 0x20100001e, 
> 0x900020104, 0x0)
> 0x1007d7b5e48 0xffffffffa01d8180 [cman]membership_kthread+0xbd9
> 0x1007d7b5f58 0xffffffff8010fdeb child_rip+0x8 [0]kdb>
> lizard03
> ----------- [cut here ] --------- [please bite here ]
> --------- Kernel BUG at membership:3150 invalid operand: 0000 [1] SMP
> Entering kdb (current=0x000001007cda1030, pid 7494) on processor 0 
> Oops: <NULL> due to oops @ 0xffffffffa01d4131
>      r15 = 0x0000000000000003      r14 = 0xffffffffa01ee5e0
>      r13 = 0x00000100e2961a78      r12 = 0xffffffffffffffff
>      rbp = 0xffffffffa01ee750      rbx = 0xffffffffa01ee5e0
>      r11 = 0xffffffff80000000      r10 = 0xffffffff80000000
>       r9 = 0x0000000000000001       r8 = 0x0000000000000004
>      rax = 0x0000000000000000      rcx = 0x0000000000000080
>      rdx = 0x0000000000000080      rsi = 0x0000000000000080
>      rdi = 0x000001007c851e08 orig_rax = 0xffffffffffffffff
>      rip = 0xffffffffa01d4131       cs = 0x0000000000000010
>   eflags = 0x0000000000010246      rsp = 0x000001007c851df0
>       ss = 0x000001007c850000 &regs = 0x000001007c851d58 [0]kdb> bt 
> Stack traceback for pid 7494
> 0x000001007cda1030     7494        1  1    0   R  
> 0x000001007cda1430 *cman_memb
> RSP           RIP                Function (args)
> 0x1007c851df0 0xffffffffa01d4131 [cman]elect_master+0x3a 
> (0x100e322a180, 0x27c850712)
> 0x1007c851e08 0xffffffffa01d5373 [cman]a_node_just_died+0x186 
> (0xffffffffa01ee640, 0x1007cda1030)
> 0x1007c851e28 0xffffffffa01d56a6
> [cman]process_dead_nodes+0x5e (0x100005efa120000, 0x23d10d00000039, 
> 0x2014b001e, 0x1400020104, 0xd00000039010000)
> 0x1007c851e48 0xffffffffa01d8158 [cman]membership_kthread+0xbb1
> 0x1007c851f58 0xffffffff8010fdeb child_rip+0x8 [0]kdb>
Comment 3 Christine Caulfield 2006-04-11 03:29:38 EDT
I've seen this on the mailing list too. cman is trying to elect a master node
but there are no active nodes in the cluster (it thinks), so it can carry on.

Are there any other cman-related messages on the other nodes and earlier on in
the syslog of this node ?
Comment 4 Christine Caulfield 2006-04-19 11:35:26 EDT
Here's an attempt to fix this

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision:; previous revision:

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision:; previous revision:
Comment 5 Corey Marthaler 2006-05-01 11:52:11 EDT
Here's are a few cman messages that QA saw before the panic. Once we get the new
cman build we'll verify it's fixed.

CMAN: removing node link-07 from the cluster : No response to messages
CMAN: Attempt to re-add node with id 1
CMAN: existing node is link-01
CMAN: new node is link-08
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at membership:279
invalid operand: 0000 [1] SMP
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U)
md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptsas
mptspi mptfc mptscsi mptbase sd_mod scsi_mod
Pid: 4055, comm: cman_memb Not tainted 2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa02150d5>] <ffffffffa02150d5>{:cman:dispatch_messages+4558}
RSP: 0018:0000010039f2dd48  EFLAGS: 00010212
RAX: 000000000000001d RBX: 0000010039f2dda8 RCX: 0000000000000246
RDX: 0000000000004b39 RSI: 0000000000000246 RDI: ffffffff803d9e60
RBP: 0000010039f2de38 R08: 000000000000000d R09: 0000010039f2dda8
R10: 0000000100000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000001 R14: 000001003d26fe00 R15: 0000010039f2ddc8
FS:  0000002a95574b00(0000) GS:ffffffff804d7b80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9556c000 CR3: 0000000037e22000 CR4: 00000000000006e0
Process cman_memb (pid: 4055, threadinfo 0000010039f2c000, task 0000010039a91030)
Stack: 0000010039a91030 0000010020712a60 0000000000000000 ffffffff80304a85
       0000010039f2de38 ffffffff80304add 000000001ff89030 0000010039a92a00
       0000000000000000 0000000000000074
Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff80304add>{thread_return+88}
       <ffffffff8013212e>{schedule_tail+55} <ffffffff80110e17>{child_rip+8}

Code: 0f 0b e3 de 21 a0 ff ff ff ff 17 01 48 c7 c7 60 d2 22 a0 e8
RIP <ffffffffa02150d5>{:cman:dispatch_messages+4558} RSP <0000010039f2dd48>
 <0>Kernel panic - not syncing: Oops

Comment 6 Christine Caulfield 2006-05-02 03:51:52 EDT
*** Bug 190230 has been marked as a duplicate of this bug. ***
Comment 9 Red Hat Bugzilla 2006-08-10 17:32:10 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.