Description of problem: cman triggers aisexec segfault Version-Release number of selected component (if applicable): [root@west-01 openais]# rpm -qa | grep cman cman-2.0.115-1.el5_4.11 cman-debuginfo-2.0.115-1.el5_4.11 How reproducible: run sts on 4 node x86_64 west01-west-04 cluster Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: I have seen this stack backtrace before in cman, but the symbols were not activated so we filed a symbol addition bug and I thought also a bug for this issue. But I can't find this particular issue in bz so filing bug possibly a dup. Program terminated with signal 11, Segmentation fault. #0 0x00002aaaaaeb61eb in unbind_con () from /usr/libexec/lcrso/service_cman.lcrso (gdb) where #0 0x00002aaaaaeb61eb in unbind_con () from /usr/libexec/lcrso/service_cman.lcrso #1 0x00002aaaaaeb3728 in ?? () from /usr/libexec/lcrso/service_cman.lcrso #2 0x00002aaaaaeb38db in ?? () from /usr/libexec/lcrso/service_cman.lcrso #3 0x00002aaaaaeb3a61 in send_status_return () from /usr/libexec/lcrso/service_cman.lcrso #4 0x00002aaaaaeb6ff5 in send_to_userport () from /usr/libexec/lcrso/service_cman.lcrso #5 0x00002aaaaaeb4c49 in ?? () from /usr/libexec/lcrso/service_cman.lcrso #6 0x0000000000415375 in app_deliver_fn (nodeid=1, iovec=<value optimized out>, iov_len=1, endian_conversion_required=0) at totempg.c:460 #7 0x00000000004157fc in totempg_deliver_fn (nodeid=1, iovec=<value optimized out>, iov_len=<value optimized out>, endian_conversion_required=0) at totempg.c:604 #8 0x0000000000410488 in messages_deliver_to_app (instance=0x2aaaaaaae010, skip=0, end_point=<value optimized out>) at totemsrp.c:3555 #9 0x00000000004121cc in message_handler_orf_token (instance=0x2aaaaaaae010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3427 #10 0x0000000000409f43 in rrp_deliver_fn (context=0x54ace60, msg=0x54d5918, msg_len=70) at totemrrp.c:1308 #11 0x00000000004084fb in net_deliver_fn (handle=<value optimized out>, ---Type <return> to continue, or q <return> to quit--- fd=<value optimized out>, revents=<value optimized out>, data=0x54d5270) at totemnet.c:695 #12 0x0000000000405d10 in poll_run (handle=0) at aispoll.c:402 #13 0x0000000000418a44 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:620 (gdb) up The message contents are as follows: (gdb) print stripped_iovec $8 = {iov_base = 0x7fff22bb2110, iov_len = 20} (gdb) print ((char *)stripped_iovec.iov_base)[0] $9 = 0 '\0' (gdb) print ((char *)stripped_iovec.iov_base)[1] $10 = 0 '\0' (gdb) print ((char *)stripped_iovec.iov_base)[2] $11 = 122 'z' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[3] $12 = 211 '�' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[4] $13 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[5] $14 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[6] $15 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[7] $16 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[8] $17 = 1 '\001' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[9] $18 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[10] $19 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[11] $20 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[12] $21 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[13] $22 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[14] $23 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[15] $24 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[16] $25 = 7 '\a' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[17] $26 = 36 '$' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[18] $27 = 0 '\0' (gdb) print ((unsigned char *)stripped_iovec.iov_base)[19] $28 = 0 '\0' Please note cman uses its own totem groups for communication and its own delivery channels, not the builtin openais versions. I noticed that cl_protheader does not look safe for 32/64 cross communication but could be wrong.
That's rather an odd stacktrace. unbind_con is never called from any of the routines listed above it! Also there is nowhere that I can see in unbind_con that can segv - the only pointer is uses is defererenced several times before it is called. Can you run it with proper debugging info? or post reliable reproduction instructions ?
#0 unbind_con (con=0xcf4d750) at commands.c:1448 1448 port_array[con->port] = NULL; (gdb) print con->port $1 = 1649748712 (gdb) print con $2 = (struct connection *) 0xcf4d750 (gdb) print *con $3 = {fd = 1649737756, type = 53, port = 1649748712, shutdown_reply = 53, events = 0, confchg = 0, write_msgs = {n = 0x41, p = 0x3562552a38}, num_write_msgs = 1649748536, next = 0x11acc2b8b9f53e60, list = {n = 0x2aaaab0bdb70, p = 0xcf4cf38}} (gdb) up #1 0x00002aaaaaeb3728 in remove_client (handle=<value optimized out>, con=0xcf4d750) at daemon.c:124 124 unbind_con(con); (gdb) up #2 0x00002aaaaaeb38db in send_reply_message (con=0xcf4d750, msg=0x7fffe372b680) at daemon.c:89 89 remove_client(ais_poll_handle, con); (gdb) up #3 0x00002aaaaaeb3a61 in send_status_return (con=0x7fffe372b610, cmd=<value optimized out>, status=1649748712) at daemon.c:421 421 return send_reply_message(con, (struct sock_header *)&msg); (gdb) print con $4 = (struct connection *) 0x7fffe372b610 (gdb) print *con $5 = {fd = 59392, type = CON_COMMS, port = 4294967287, shutdown_reply = 4294967295, events = 3815946064, confchg = 32767, write_msgs = { n = 0x2aaaaaeb3728, p = 0x7fffe372b750}, num_write_msgs = 217372496, next = 0x7fffe372b680, list = {n = 0x2aaaaaeb38db, p = 0x7fffe372b750}} (gdb) up #4 0x00002aaaaaeb6ff5 in send_to_userport (fromport=<value optimized out>, toport=<value optimized out>, nodeid=2, tgtid=<value optimized out>, recv_buf=0x7fffe372b750 "\a�\003", len=4, endian_conv=0) at commands.c:1901 1901 send_status_return(shutdown_con, CMAN_CMD_TRY_SHUTDOWN, 0); (gdb) down #3 0x00002aaaaaeb3a61 in send_status_return (con=0x7fffe372b610, cmd=<value optimized out>, status=1649748712) at daemon.c:421 421 return send_reply_message(con, (struct sock_header *)&msg); (gdb) up #4 0x00002aaaaaeb6ff5 in send_to_userport (fromport=<value optimized out>, toport=<value optimized out>, nodeid=2, tgtid=<value optimized out>, recv_buf=0x7fffe372b750 "\a�\003", len=4, endian_conv=0) at commands.c:1901 1901 send_status_return(shutdown_con, CMAN_CMD_TRY_SHUTDOWN, 0); (gdb) print shutdown_con $6 = (struct connection *) 0xcf4d750 (gdb) print *shutdown_con $7 = {fd = 1649737756, type = 53, port = 1649748712, shutdown_reply = 53, events = 0, confchg = 0, write_msgs = {n = 0x41, p = 0x3562552a38}, num_write_msgs = 1649748536, next = 0x11acc2b8b9f53e60, list = {n = 0x2aaaab0bdb70, p = 0xcf4cf38}} (gdb) up #5 0x00002aaaaaeb4c49 in cman_deliver_fn (nodeid=<value optimized out>, iovec=0x7fffe372b7d0, iov_len=<value optimized out>, endian_conversion_required=0) at ais.c:412 412 send_to_userport(header->srcport, header->tgtport, (gdb) print *header $8 = {tgtport = 0 '\0', srcport = 0 '\0', pad = 25138, flags = 0, srcid = 2, tgtid = 0} (gdb) print *iovec $9 = {iov_base = 0x7fffe372b740, iov_len = 20} (gdb) print sizeof (struct cl_protheader) $10 = 16 (gdb) print buf $11 = 0 (gdb) print iovec->iov_base $12 = (void *) 0x7fffe372b740 (gdb) print header $13 = (struct cl_protheader *) 0x7fffe372b740 (gdb) print *header $14 = {tgtport = 0 '\0', srcport = 0 '\0', pad = 25138, flags = 0, srcid = 2, tgtid = 0} (gdb) print buf $15 = 0 (gdb) down #4 0x00002aaaaaeb6ff5 in send_to_userport (fromport=<value optimized out>, toport=<value optimized out>, nodeid=2, tgtid=<value optimized out>, recv_buf=0x7fffe372b750 "\a�\003", len=4, endian_conv=0) at commands.c:1901 1901 send_status_return(shutdown_con, CMAN_CMD_TRY_SHUTDOWN, 0); (gdb) print to_port No symbol "to_port" in current context. (gdb) up #5 0x00002aaaaaeb4c49 in cman_deliver_fn (nodeid=<value optimized out>, iovec=0x7fffe372b7d0, iov_len=<value optimized out>, endian_conversion_required=0) at ais.c:412 412 send_to_userport(header->srcport, header->tgtport, (gdb) print to_port No symbol "to_port" in current context. (gdb) up #6 0x00000000004152f5 in ?? () (gdb) down #5 0x00002aaaaaeb4c49 in cman_deliver_fn (nodeid=<value optimized out>, iovec=0x7fffe372b7d0, iov_len=<value optimized out>, endian_conversion_required=0) at ais.c:412 412 send_to_userport(header->srcport, header->tgtport, (gdb) print header->srcport $16 = 0 '\0' (gdb) print header->srcid $17 = 2 (gdb) print header->tgtport $18 = 0 '\0' (gdb) print header->tgtid $19 = 0 (gdb) print endian_conversion_required $20 = 0 (gdb) down #4 0x00002aaaaaeb6ff5 in send_to_userport (fromport=<value optimized out>, toport=<value optimized out>, nodeid=2, tgtid=<value optimized out>, recv_buf=0x7fffe372b750 "\a�\003", len=4, endian_conv=0) at commands.c:1901 1901 send_status_return(shutdown_con, CMAN_CMD_TRY_SHUTDOWN, 0); (gdb) print shutdown_con $21 = (struct connection *) 0xcf4d750 (gdb) rpint *shutdown_con Undefined command: "rpint". Try "help". (gdb) print *shutdown_con $22 = {fd = 1649737756, type = 53, port = 1649748712, shutdown_reply = 53, events = 0, confchg = 0, write_msgs = {n = 0x41, p = 0x3562552a38}, num_write_msgs = 1649748536, next = 0x11acc2b8b9f53e60, list = {n = 0x2aaaab0bdb70, p = 0xcf4cf38}} (gdb)
Chrissie, I've attached a patch which adds -g to the "optimized" build environment for the service engine. This is required to get debug symbols in comment #2. It has no affect on optimization and only produces a slightly larger package for the debuginfo package. I believe there was a 5.5 issue on this but it did not solve the problem - maybe you can reopen that bug or file a new bug with this patch. Regards -steve
Created attachment 413246 [details] patch which adds -g to CFLAGS for service engine to get proper debug output
Created attachment 413742 [details] Patch that might fix Thanks Steven, that gave me the information I needed. Try this patch, I hope it will fix it.
I've just checked, and this patch is already in STABLE3/RHEL6. commit e5d433e74f26ddab3e7eedb4c50f2b3546738ecd Author: Christine Caulfield <ccaulfie> Date: Fri Sep 19 13:02:40 2008 +0100 cman: Clean shutdown_con if the controlling process is killed. If a shutdown is initiated by a process that is then killed, the shutdown_con isn't cleared. So if another process replies to the shutdown request cman could segfault. In this case the process is killing itself because the output queue is full. So you might still see odd failures that can be fixed by increasing the output queue size: <cman max_queued="1024"/>
Christine, I was unable to test this patch before releasing the test cluster back to QE. It did have a fairly high production rate with the 5.4 qe test cases however resulting in a segfault. Regards -steve
*** Bug 746936 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0167.html