Bug 590101

Summary: cman service engine segfaults in unbind_con (no symbols)
Product: Red Hat Enterprise Linux 5 Reporter: Steven Dake <sdake>
Component: cmanAssignee: Fabio Massimo Di Nitto <fdinitto>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.4CC: cluster-maint, djansa, edamato, fdinitto, mjuricek, rdassen
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cman-2.0.115-93.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-21 05:23:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 795814    
Attachments:
Description Flags
patch which adds -g to CFLAGS for service engine to get proper debug output
none
Patch that might fix none

Description Steven Dake 2010-05-07 18:00:35 UTC
Description of problem:
cman triggers aisexec segfault

Version-Release number of selected component (if applicable):
[root@west-01 openais]# rpm -qa | grep cman
cman-2.0.115-1.el5_4.11
cman-debuginfo-2.0.115-1.el5_4.11


How reproducible:
run sts on 4 node x86_64 west01-west-04 cluster

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
I have seen this stack backtrace before in cman, but the symbols were not activated so we filed a symbol addition bug and I thought also a bug for this issue.  But I can't find this particular issue in bz so filing bug possibly a dup.

Program terminated with signal 11, Segmentation fault.
#0  0x00002aaaaaeb61eb in unbind_con ()
   from /usr/libexec/lcrso/service_cman.lcrso
(gdb) where
#0  0x00002aaaaaeb61eb in unbind_con ()
   from /usr/libexec/lcrso/service_cman.lcrso
#1  0x00002aaaaaeb3728 in ?? () from /usr/libexec/lcrso/service_cman.lcrso
#2  0x00002aaaaaeb38db in ?? () from /usr/libexec/lcrso/service_cman.lcrso
#3  0x00002aaaaaeb3a61 in send_status_return ()
   from /usr/libexec/lcrso/service_cman.lcrso
#4  0x00002aaaaaeb6ff5 in send_to_userport ()
   from /usr/libexec/lcrso/service_cman.lcrso
#5  0x00002aaaaaeb4c49 in ?? () from /usr/libexec/lcrso/service_cman.lcrso
#6  0x0000000000415375 in app_deliver_fn (nodeid=1, 
    iovec=<value optimized out>, iov_len=1, endian_conversion_required=0)
    at totempg.c:460
#7  0x00000000004157fc in totempg_deliver_fn (nodeid=1, 
    iovec=<value optimized out>, iov_len=<value optimized out>, 
    endian_conversion_required=0) at totempg.c:604
#8  0x0000000000410488 in messages_deliver_to_app (instance=0x2aaaaaaae010, 
    skip=0, end_point=<value optimized out>) at totemsrp.c:3555
#9  0x00000000004121cc in message_handler_orf_token (instance=0x2aaaaaaae010, 
    msg=<value optimized out>, msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3427
#10 0x0000000000409f43 in rrp_deliver_fn (context=0x54ace60, msg=0x54d5918, 
    msg_len=70) at totemrrp.c:1308
#11 0x00000000004084fb in net_deliver_fn (handle=<value optimized out>, 
---Type <return> to continue, or q <return> to quit---
    fd=<value optimized out>, revents=<value optimized out>, data=0x54d5270)
    at totemnet.c:695
#12 0x0000000000405d10 in poll_run (handle=0) at aispoll.c:402
#13 0x0000000000418a44 in main (argc=<value optimized out>, 
    argv=<value optimized out>) at main.c:620
(gdb) up


The message contents are as follows:
(gdb) print stripped_iovec
$8 = {iov_base = 0x7fff22bb2110, iov_len = 20}
(gdb) print ((char *)stripped_iovec.iov_base)[0]
$9 = 0 '\0'
(gdb) print ((char *)stripped_iovec.iov_base)[1]
$10 = 0 '\0'
(gdb) print ((char *)stripped_iovec.iov_base)[2]
$11 = 122 'z'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[3]
$12 = 211 '�'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[4]
$13 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[5]
$14 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[6]
$15 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[7]
$16 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[8]
$17 = 1 '\001'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[9]
$18 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[10]
$19 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[11]
$20 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[12]
$21 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[13]
$22 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[14]
$23 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[15]
$24 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[16]
$25 = 7 '\a'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[17]
$26 = 36 '$'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[18]
$27 = 0 '\0'
(gdb) print ((unsigned char *)stripped_iovec.iov_base)[19]
$28 = 0 '\0'

Please note cman uses its own totem groups for communication and its own delivery channels, not the builtin openais versions.

I noticed that cl_protheader does not look safe for 32/64 cross communication but could be wrong.

Comment 1 Christine Caulfield 2010-05-10 08:38:31 UTC
That's rather an odd stacktrace. unbind_con is never called from any of the routines listed above it! Also there is nowhere that I can see in unbind_con that can segv - the only pointer is uses is defererenced several times before it is called.

Can you run it with proper debugging info? or post reliable reproduction instructions ?

Comment 2 Steven Dake 2010-05-11 20:11:30 UTC
#0  unbind_con (con=0xcf4d750) at commands.c:1448
1448			port_array[con->port] = NULL;
(gdb) print con->port
$1 = 1649748712
(gdb) print con
$2 = (struct connection *) 0xcf4d750
(gdb) print *con
$3 = {fd = 1649737756, type = 53, port = 1649748712, shutdown_reply = 53, events = 0, confchg = 0, write_msgs = {n = 0x41, p = 0x3562552a38}, 
  num_write_msgs = 1649748536, next = 0x11acc2b8b9f53e60, list = {n = 0x2aaaab0bdb70, p = 0xcf4cf38}}
(gdb) up
#1  0x00002aaaaaeb3728 in remove_client (handle=<value optimized out>, con=0xcf4d750) at daemon.c:124
124		unbind_con(con);
(gdb) up
#2  0x00002aaaaaeb38db in send_reply_message (con=0xcf4d750, msg=0x7fffe372b680) at daemon.c:89
89				remove_client(ais_poll_handle, con);
(gdb) up
#3  0x00002aaaaaeb3a61 in send_status_return (con=0x7fffe372b610, cmd=<value optimized out>, status=1649748712) at daemon.c:421
421		return send_reply_message(con, (struct sock_header *)&msg);
(gdb) print con
$4 = (struct connection *) 0x7fffe372b610
(gdb) print *con
$5 = {fd = 59392, type = CON_COMMS, port = 4294967287, shutdown_reply = 4294967295, events = 3815946064, confchg = 32767, write_msgs = {
    n = 0x2aaaaaeb3728, p = 0x7fffe372b750}, num_write_msgs = 217372496, next = 0x7fffe372b680, list = {n = 0x2aaaaaeb38db, p = 0x7fffe372b750}}
(gdb) up
#4  0x00002aaaaaeb6ff5 in send_to_userport (fromport=<value optimized out>, toport=<value optimized out>, nodeid=2, tgtid=<value optimized out>, 
    recv_buf=0x7fffe372b750 "\a�\003", len=4, endian_conv=0) at commands.c:1901
1901					send_status_return(shutdown_con, CMAN_CMD_TRY_SHUTDOWN, 0);
(gdb) down
#3  0x00002aaaaaeb3a61 in send_status_return (con=0x7fffe372b610, cmd=<value optimized out>, status=1649748712) at daemon.c:421
421		return send_reply_message(con, (struct sock_header *)&msg);
(gdb) up
#4  0x00002aaaaaeb6ff5 in send_to_userport (fromport=<value optimized out>, toport=<value optimized out>, nodeid=2, tgtid=<value optimized out>, 
    recv_buf=0x7fffe372b750 "\a�\003", len=4, endian_conv=0) at commands.c:1901
1901					send_status_return(shutdown_con, CMAN_CMD_TRY_SHUTDOWN, 0);
(gdb) print shutdown_con
$6 = (struct connection *) 0xcf4d750
(gdb) print *shutdown_con
$7 = {fd = 1649737756, type = 53, port = 1649748712, shutdown_reply = 53, events = 0, confchg = 0, write_msgs = {n = 0x41, p = 0x3562552a38}, 
  num_write_msgs = 1649748536, next = 0x11acc2b8b9f53e60, list = {n = 0x2aaaab0bdb70, p = 0xcf4cf38}}
(gdb) up
#5  0x00002aaaaaeb4c49 in cman_deliver_fn (nodeid=<value optimized out>, iovec=0x7fffe372b7d0, iov_len=<value optimized out>, 
    endian_conversion_required=0) at ais.c:412
412			send_to_userport(header->srcport, header->tgtport,
(gdb) print *header
$8 = {tgtport = 0 '\0', srcport = 0 '\0', pad = 25138, flags = 0, srcid = 2, tgtid = 0}
(gdb) print *iovec
$9 = {iov_base = 0x7fffe372b740, iov_len = 20}
(gdb) print sizeof (struct cl_protheader)
$10 = 16
(gdb) print buf
$11 = 0
(gdb) print iovec->iov_base
$12 = (void *) 0x7fffe372b740
(gdb) print header
$13 = (struct cl_protheader *) 0x7fffe372b740
(gdb) print *header
$14 = {tgtport = 0 '\0', srcport = 0 '\0', pad = 25138, flags = 0, srcid = 2, tgtid = 0}
(gdb) print buf
$15 = 0
(gdb) down
#4  0x00002aaaaaeb6ff5 in send_to_userport (fromport=<value optimized out>, toport=<value optimized out>, nodeid=2, tgtid=<value optimized out>, 
    recv_buf=0x7fffe372b750 "\a�\003", len=4, endian_conv=0) at commands.c:1901
1901					send_status_return(shutdown_con, CMAN_CMD_TRY_SHUTDOWN, 0);
(gdb) print to_port
No symbol "to_port" in current context.
(gdb) up
#5  0x00002aaaaaeb4c49 in cman_deliver_fn (nodeid=<value optimized out>, iovec=0x7fffe372b7d0, iov_len=<value optimized out>, 
    endian_conversion_required=0) at ais.c:412
412			send_to_userport(header->srcport, header->tgtport,
(gdb) print to_port
No symbol "to_port" in current context.
(gdb) up
#6  0x00000000004152f5 in ?? ()
(gdb) down
#5  0x00002aaaaaeb4c49 in cman_deliver_fn (nodeid=<value optimized out>, iovec=0x7fffe372b7d0, iov_len=<value optimized out>, 
    endian_conversion_required=0) at ais.c:412
412			send_to_userport(header->srcport, header->tgtport,
(gdb) print header->srcport
$16 = 0 '\0'
(gdb) print header->srcid
$17 = 2
(gdb) print header->tgtport
$18 = 0 '\0'
(gdb) print header->tgtid
$19 = 0
(gdb) print endian_conversion_required
$20 = 0
(gdb) down
#4  0x00002aaaaaeb6ff5 in send_to_userport (fromport=<value optimized out>, toport=<value optimized out>, nodeid=2, tgtid=<value optimized out>, 
    recv_buf=0x7fffe372b750 "\a�\003", len=4, endian_conv=0) at commands.c:1901
1901					send_status_return(shutdown_con, CMAN_CMD_TRY_SHUTDOWN, 0);
(gdb) print shutdown_con
$21 = (struct connection *) 0xcf4d750
(gdb) rpint *shutdown_con
Undefined command: "rpint".  Try "help".
(gdb) print *shutdown_con
$22 = {fd = 1649737756, type = 53, port = 1649748712, shutdown_reply = 53, events = 0, confchg = 0, write_msgs = {n = 0x41, p = 0x3562552a38}, 
  num_write_msgs = 1649748536, next = 0x11acc2b8b9f53e60, list = {n = 0x2aaaab0bdb70, p = 0xcf4cf38}}
(gdb)

Comment 3 Steven Dake 2010-05-11 20:13:10 UTC
Chrissie,

I've attached a patch which adds -g to the "optimized" build environment for the service engine.  This is required to get debug symbols in comment #2.  It has no affect on optimization and only produces a slightly larger package for the debuginfo package.  I believe there was a 5.5 issue on this but it did not solve the problem - maybe you can reopen that bug or file a new bug with this patch.

Regards
-steve

Comment 4 Steven Dake 2010-05-11 20:14:29 UTC
Created attachment 413246 [details]
patch which adds -g to CFLAGS for service engine to get proper debug output

Comment 7 Christine Caulfield 2010-05-13 12:38:46 UTC
Created attachment 413742 [details]
Patch that might fix

Thanks Steven, that gave me the information I needed. Try this patch, I hope it will fix it.

Comment 8 Christine Caulfield 2010-05-13 15:11:55 UTC
I've just checked, and this patch is already in STABLE3/RHEL6.

commit e5d433e74f26ddab3e7eedb4c50f2b3546738ecd
Author: Christine Caulfield <ccaulfie>
Date:   Fri Sep 19 13:02:40 2008 +0100

    cman: Clean shutdown_con if the controlling process is killed.
    
    If a shutdown is initiated by a process that is then killed, the
    shutdown_con isn't cleared. So if another process replies to the
    shutdown request cman could segfault.


In this case the process is killing itself because the output queue is full. So you might still see odd failures that can be fixed by increasing the output queue size: 

<cman max_queued="1024"/>

Comment 9 Steven Dake 2010-05-17 15:01:58 UTC
Christine,

I was unable to test this patch before releasing the test cluster back to QE.  It did have a fairly high production rate with the 5.4 qe test cases however resulting in a segfault.

Regards
-steve

Comment 16 Lon Hohberger 2012-02-10 17:04:39 UTC
*** Bug 746936 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2012-02-21 05:23:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0167.html