Bug 164535
Summary: | cman causing kernel panic | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Jonathan Earl Brassow <jbrassow> |
Component: | cman | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | cluster-maint, cmarthal |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2006-0559 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-08-10 21:32:10 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jonathan Earl Brassow
2005-07-28 15:42:49 UTC
That message: SM: 01000001 sm_stop: SG still joined bothers me. It might be the case that cman has shut down (thus freeing its list of nodes) before you've called kcl_get_node_addresses(). There's no loop in get_node_by_nodeid, it simply indexes an array (with bounds checking) so this seems the most likely scenario. In fact, looking at the code, there is an OBO error that allows a request for nodeid==0 even when the cluster is down. This is the real bug I think. Here are a couple more panics seen in cman membership on cman-1.0.4-0 at
bootup:
> --------- Kernel BUG at membership:279 invalid operand: 0000 [1] SMP
>
> Entering kdb (current=0x00000100e63c27f0, pid 7400) on processor 0
> Oops: <NULL> due to oops @ 0xffffffffa01d70dd
> r15 = 0x000001007d7b5dc8 r14 = 0x00000100e78b6400
> r13 = 0x0000000000000003 r12 = 0x0000000000000003
> rbp = 0x000001007d7b5e38 rbx = 0x000001007d7b5da8
> r11 = 0xffffffff8011bac4 r10 = 0x0000000100000000
> r9 = 0x000001007d7b5da8 r8 = 0xffffffff803dc9c8
> rax = 0x000000000000001e rcx = 0xffffffff803dc9c8
> rdx = 0xffffffff803dc9c8 rsi = 0x0000000000000246
> rdi = 0xffffffff803dc9c0 orig_rax = 0xffffffffffffffff
> rip = 0xffffffffa01d70dd cs = 0x0000000000000010
> eflags = 0x0000000000010212 rsp = 0x000001007d7b5d30
> ss = 0x000001007d7b4000 ®s = 0x000001007d7b5c98 [0]kdb> bt
> Stack traceback for pid 7400
> 0x00000100e63c27f0 7400 1 1 0 R
> 0x00000100e63c2bf0 *cman_memb
> RSP RIP Function (args)
> 0x1007d7b5d30 0xffffffffa01d70dd
> [cman]dispatch_messages+0x11ce (0x0, 0x100e5c78a80, 0x20100001e,
> 0x900020104, 0x0)
> 0x1007d7b5e48 0xffffffffa01d8180 [cman]membership_kthread+0xbd9
> 0x1007d7b5f58 0xffffffff8010fdeb child_rip+0x8 [0]kdb>
>
>
> lizard03
> ----------- [cut here ] --------- [please bite here ]
> --------- Kernel BUG at membership:3150 invalid operand: 0000 [1] SMP
>
> Entering kdb (current=0x000001007cda1030, pid 7494) on processor 0
> Oops: <NULL> due to oops @ 0xffffffffa01d4131
> r15 = 0x0000000000000003 r14 = 0xffffffffa01ee5e0
> r13 = 0x00000100e2961a78 r12 = 0xffffffffffffffff
> rbp = 0xffffffffa01ee750 rbx = 0xffffffffa01ee5e0
> r11 = 0xffffffff80000000 r10 = 0xffffffff80000000
> r9 = 0x0000000000000001 r8 = 0x0000000000000004
> rax = 0x0000000000000000 rcx = 0x0000000000000080
> rdx = 0x0000000000000080 rsi = 0x0000000000000080
> rdi = 0x000001007c851e08 orig_rax = 0xffffffffffffffff
> rip = 0xffffffffa01d4131 cs = 0x0000000000000010
> eflags = 0x0000000000010246 rsp = 0x000001007c851df0
> ss = 0x000001007c850000 ®s = 0x000001007c851d58 [0]kdb> bt
> Stack traceback for pid 7494
> 0x000001007cda1030 7494 1 1 0 R
> 0x000001007cda1430 *cman_memb
> RSP RIP Function (args)
> 0x1007c851df0 0xffffffffa01d4131 [cman]elect_master+0x3a
> (0x100e322a180, 0x27c850712)
> 0x1007c851e08 0xffffffffa01d5373 [cman]a_node_just_died+0x186
> (0xffffffffa01ee640, 0x1007cda1030)
> 0x1007c851e28 0xffffffffa01d56a6
> [cman]process_dead_nodes+0x5e (0x100005efa120000, 0x23d10d00000039,
> 0x2014b001e, 0x1400020104, 0xd00000039010000)
> 0x1007c851e48 0xffffffffa01d8158 [cman]membership_kthread+0xbb1
> 0x1007c851f58 0xffffffff8010fdeb child_rip+0x8 [0]kdb>
I've seen this on the mailing list too. cman is trying to elect a master node but there are no active nodes in the cluster (it thinks), so it can carry on. Are there any other cman-related messages on the other nodes and earlier on in the syslog of this node ? Here's an attempt to fix this -rSTABLE: Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.18.6.4; previous revision: 1.44.2.18.6.3 done -rRHEL4: Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.22; previous revision: 1.44.2.21 done Here's are a few cman messages that QA saw before the panic. Once we get the new cman build we'll verify it's fixed. CMAN: removing node link-07 from the cluster : No response to messages CMAN: Attempt to re-add node with id 1 CMAN: existing node is link-01 CMAN: new node is link-08 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at membership:279 invalid operand: 0000 [1] SMP CPU 1 Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptfc mptscsi mptbase sd_mod scsi_mod Pid: 4055, comm: cman_memb Not tainted 2.6.9-34.ELsmp RIP: 0010:[<ffffffffa02150d5>] <ffffffffa02150d5>{:cman:dispatch_messages+4558} RSP: 0018:0000010039f2dd48 EFLAGS: 00010212 RAX: 000000000000001d RBX: 0000010039f2dda8 RCX: 0000000000000246 RDX: 0000000000004b39 RSI: 0000000000000246 RDI: ffffffff803d9e60 RBP: 0000010039f2de38 R08: 000000000000000d R09: 0000010039f2dda8 R10: 0000000100000000 R11: 0000000000000000 R12: 0000000000000001 R13: 0000000000000001 R14: 000001003d26fe00 R15: 0000010039f2ddc8 FS: 0000002a95574b00(0000) GS:ffffffff804d7b80(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a9556c000 CR3: 0000000037e22000 CR4: 00000000000006e0 Process cman_memb (pid: 4055, threadinfo 0000010039f2c000, task 0000010039a91030) Stack: 0000010039a91030 0000010020712a60 0000000000000000 ffffffff80304a85 0000010039f2de38 ffffffff80304add 000000001ff89030 0000010039a92a00 0000000000000000 0000000000000074 Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff80304add>{thread_return+88} <ffffffffa0216178>{:cman:membership_kthread+3033} <ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0} <ffffffff8013212e>{schedule_tail+55} <ffffffff80110e17>{child_rip+8} <ffffffffa021559f>{:cman:membership_kthread+0} <ffffffff80110e0f>{child_rip+0} Code: 0f 0b e3 de 21 a0 ff ff ff ff 17 01 48 c7 c7 60 d2 22 a0 e8 RIP <ffffffffa02150d5>{:cman:dispatch_messages+4558} RSP <0000010039f2dd48> <0>Kernel panic - not syncing: Oops *** Bug 190230 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0559.html |