Bug 472460

Summary: cman_tool nodes -F name segfaults when a node is out of membership
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: baptiste.millemathias, cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 11:08:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch to make do_cmd_get_node_addrs() return data. none

Description Nate Straz 2008-11-20 22:53:40 UTC
Description of problem:

When another node in the cluster is out of membership, the output shows this:

[root@south-02 ~]# cman_tool nodes 
Node  Sts   Inc   Joined               Name
   1   X  106200                        south-01
   2   M  106028   2008-11-20 11:02:44  south-02
...

But when the -F option is used, the command segfaults.

[root@south-02 ~]# cman_tool nodes -F name,type
Segmentation fault (core dumped)

Once the node returns to membership, the command works again.

[root@south-02 ~]# cman_tool nodes -F name,type
south-01 M 
south-02 M 
...

Version-Release number of selected component (if applicable):
cman-2.0.97-1.el5

How reproducible:
100%

Steps to Reproduce:
1. on node A, service cman stop
2. on node B, cman_tool nodes -F name,type
  
Actual results:
See above

Expected results:
cman_tool nodes -F name,type
south-01 X
south-02 M

Additional info:

Comment 1 Christine Caulfield 2008-11-21 08:31:47 UTC
This isn't as easy to reproduce as you might thing... can you provide a backtrace from the coredump please:

[root@cc-xen-03 ~]# cman_tool nodes             
Node  Sts   Inc   Joined               Name
   1   X  31928                        cc-xen-01.lab.msp.redhat.com
   2   M  31928   2008-11-14 02:01:48  cc-xen-02.lab.msp.redhat.com
   3   M  31924   2008-11-14 02:01:47  cc-xen-03.lab.msp.redhat.com
   4   X  31940                        cc-xen-04.lab.msp.redhat.com
   5   X      0                        cc-xen-05.lab.msp.redhat.com
   6   X      0                        cc-xen-06.lab.msp.redhat.com

[root@cc-xen-03 ~]# cman_tool nodes -F name,type
cc-xen-01.lab.msp.redhat.com X 
cc-xen-02.lab.msp.redhat.com M 
cc-xen-03.lab.msp.redhat.com M 
cc-xen-04.lab.msp.redhat.com X 
cc-xen-05.lab.msp.redhat.com X 
cc-xen-06.lab.msp.redhat.com X

Comment 2 Nate Straz 2008-11-21 14:02:26 UTC
Here is some info from the core south-02:/tmp/core.12531.  There are several more to choose from in the same directory.

[New process 12531]
#0  0x0000000000406aad in cman_get_node_addrs (handle=0x16770030, nodeid=1, 
    max_addrs=10, num_addrs=0x7fffe8ef8684, addrs=0x7fffe8ef8400)
    at libcman.c:1092
1092                            memcpy(&addrs[i].cna_address, &outbuf->addrs[i].addr, outbuf->addrs[i].addrlen);
(gdb) bt
#0  0x0000000000406aad in cman_get_node_addrs (handle=0x16770030, nodeid=1, 
    max_addrs=10, num_addrs=0x7fffe8ef8684, addrs=0x7fffe8ef8400)
    at libcman.c:1092
#1  0x00000000004028ab in show_nodes ()
#2  0x000081a400007fff in ?? ()
#3  0x0000001400000000 in ?? ()
#4  0x0000000000000000 in ?? ()
(gdb) list
1087
1088                    if (outbuf->numaddrs > max_addrs)
1089                            outbuf->numaddrs = max_addrs;
1090
1091                    for (i=0; i < outbuf->numaddrs; i++) {
1092                            memcpy(&addrs[i].cna_address, &outbuf->addrs[i].addr, outbuf->addrs[i].addrlen);
1093                            addrs[i].cna_addrlen = outbuf->addrs[i].addrlen;
1094                    }
1095            }
1096            return ret;
(gdb) info locals
i = 0
h = (struct cman_handle *) 0x16770030
ret = 0
buf = 0x7fffe8ef79f0 "\001"
outbuf = (struct cl_get_node_addrs *) 0x7fffe8ef79f0
(gdb) info args
handle = (cman_handle_t) 0x16770030
nodeid = 1
max_addrs = 10
num_addrs = (int *) 0x7fffe8ef8684
addrs = (struct cman_node_address *) 0x7fffe8ef8400
(gdb) print *num_addrs
$1 = 376900256
(gdb) print *outbuf
$2 = {numaddrs = 1, addrs = 0x7fffe8ef79f8}
(gdb) print *addrs
$3 = {cna_addrlen = 1, 
  cna_address = "�\000w\026\000\000\000\0008&\003A9", '\0' <repeats 14 times>}

Comment 3 Nate Straz 2008-11-21 16:21:12 UTC
I dug into this with Chrissie and I think I know what is going on here.
From gdb I get the following data from inside cman_dispatch():

(gdb) print *h
$6 = {magic = 1129136462, fd = 7, zero_fd = 8, privdata = 0x0, want_reply = 1, 
  event_callback = 0, data_callback = 0, confchg_callback = 0, 
  reply_buffer = 0x7fff8c258d40, reply_buflen = 1368, reply_status = 28, 
  saved_data_msg = 0x0, saved_event_msg = 0x0, saved_reply_msg = 0x0}
(gdb) print /x *header
$12 = {magic = 0x434d414e, version = 0x7fff, length = 0x18, 
  command = 0x400000bf, flags = 0x0}

Here we see that we have a header with a reply and the total size is
24 bytes.  The header itself takes 20 bytes, but the reply header
takes 24 bytes which leaves nothing for the reply data.

In do_cmd_get_node_addrs() we have the following code:

        /* AIS doesn't know about nodes that are not members */
        if (node->state != NODESTATE_MEMBER)
                return 0;

We don't set retlen here so we don't return any data for this reply.
We need to at least return an int that we have 0 node addresses to
return

I'm hitting this because in cman_get_node_addrs() we allocate the
return buffer on the stack and don't initialize it.  In my failing
case, outbuf->numaddrs == 1 and it never gets overwritten because of
the problem in do_cmd_get_node_addrs.

Comment 4 Nate Straz 2008-11-21 16:56:17 UTC
Created attachment 324322 [details]
Patch to make do_cmd_get_node_addrs() return data.

Comment 5 Christine Caulfield 2008-12-01 11:46:25 UTC
*** Bug 473873 has been marked as a duplicate of this bug. ***

Comment 6 Christine Caulfield 2008-12-01 14:03:18 UTC
Thanks for investigating this Nate, that patch looks good to me.

Comment 7 Christine Caulfield 2008-12-03 11:01:48 UTC
Patch committed for 5.4

commit 6266ee3b4f6b7c85fe87316f334d8a47a1a71165
Author: Christine Caulfield <ccaulfie>
Date:   Mon Dec 1 14:07:24 2008 +0000

    cman: Don't crash cman_tool nodes -a

Comment 8 Nate Straz 2009-06-22 18:48:07 UTC
Patch is already in the code, adding flags to make sure this shows up in the errata.

Comment 12 errata-xmlrpc 2009-09-02 11:08:26 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1341.html