Bug 696887

Summary: segfault due to invalid 'conn' pointer after killing "corosync-cfgtool -r"
Product: Red Hat Enterprise Linux 6 Reporter: Jan Friesse <jfriesse>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 6.2CC: agk, cluster-maint, djansa, fdinitto, jkortus, sdake, tserong
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-1.4.0-1.el6 Doc Type: Bug Fix
Doc Text:
Cause: Reference counting on the cfg server in corosync was incorrect. Consequence: killing corosync-cfgtool -r before it completed would result in Corosync segfault. Fix: Add proper reference counting as per architecture. Result: Corosync no longer segfaults with this test case.
Story Points: ---
Clone Of: 695191 Environment:
Last Closed: 2011-12-06 11:50:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 695191    
Bug Blocks:    

Description Jan Friesse 2011-04-15 07:16:31 UTC
+++ This bug was initially created as a clone of Bug #695191 +++

Description of problem:

If you run "corosync-cfgtool -r" and hit CTRL-C to terminate it (may require quick fingers), corosync will segfault.

Version-Release number of selected component (if applicable):

1.3.0

How reproducible:

Reasonably easy to reproduce.

Steps to Reproduce:
1. Configure two rings (rrp_mode: passive, if that matters).
2. Start corosync.
3. Drop traffic to one of the rings - in my case, ring 0, via
   "iptables -A OUTPUT -d 192.168.4.186 -j DROP".  This will tend
   to make "corosync-cfgtool -r" block a bit.
4. Run "corosync-cfgtool -r" and immediately hit CTRL-C.  You might
   have to do this a couple of times (or put it in a while loop in
   the shell and hit CTRL-C at random), but if you do terminate
   corosync-cfgtool, corosync itself will immediately segfault.

Actual results:

Corosync segfaults with the following backtrace:

(gdb) bt
#0  memcpy () at ../sysdeps/i386/i686/memcpy.S:100
#1  0xb774ded3 in coroipcs_response_send (conn=0x80f6f88,
    msg=0xbf7f7c90, mlen=24) at /usr/include/bits/string3.h:52
#2  0xada9dab4 in message_handler_req_exec_cfg_ringreenable
    (message=0xbf7f7d20, nodeid=956606656) at cfg.c:583
#3  0x0804de75 in deliver_fn (nodeid=956606656, msg=0xbf7f7d20,
    msg_len=32, endian_conversion_required=0) at main.c:852
#4  0xb77793c0 in app_deliver_fn (nodeid=956606656, msg=<value
    optimized out>, msg_len=37, endian_conversion_required=0)
    at totempg.c:506
#5  0xb777991e in totempg_deliver_fn (nodeid=956606656,
    msg=0x80f6fea, msg_len=37, endian_conversion_required=0)
    at totempg.c:618
#6  0xb7777a25 in totemmrp_deliver_fn (nodeid=956606656,
    msg=0x80f6fea, msg_len=47, endian_conversion_required=0) at
    totemmrp.c:98
#7  0xb776ed81 in messages_deliver_to_app (instance=0xb5bb4008,
    skip=0, end_point=40) at totemsrp.c:3704
#8  0xb77752a9 in message_handler_orf_token (instance=0xb5bb4008,
    msg=0x80c9dbc, msg_len=71, endian_conversion_needed=0) at
    totemsrp.c:3577
#9  0xb776d3c9 in main_deliver_fn (context=0xb5bb4008, msg=0x80c9dbc,
    msg_len=71) at totemsrp.c:4356
#10 0xb776b27b in passive_token_recv (rrp_instance=0x8082be8,
    iface_no=1, context=0xb5bb4008, msg=0x80c9dbc, msg_len=71, 
    token_seq=539) at totemrrp.c:876
#11 0xb776a4a3 in rrp_deliver_fn (context=0x8089358, msg=0x80c9dbc,
    msg_len=71) at totemrrp.c:1500
#12 0xb77651d2 in net_deliver_fn (handle=5880381755227111424, fd=13,
    revents=1, data=0x80c9758) at totemudp.c:1244
#13 0xb7761100 in poll_run (handle=5880381755227111424) at
    coropoll.c:510
#14 0x0804f508 in main (argc=Cannot access memory at address 0x6)
    at main.c:1813

Expected results:

No segfault.

Additional info:

The relevant code is:

int coroipcs_response_send (void *conn, const void *msg, size_t mlen)
{
    struct conn_info *conn_info = (struct conn_info *)conn;

    memcpy (conn_info->response_buffer, msg, mlen);

    ipc_sem_post (conn_info->control_buffer, SEMAPHORE_RESPONSE);

    api->stats_increment_value (conn_info->stats_handle, "responses");
    return (0);
}

At this point, 'conn' points to junk, presumably because
corosync-cfgtool was terminated after sending the ring re-enable
message but before receiving a response, so there's no "other end" for
the IPC.

--- Additional comment from tserong on 2011-04-14 01:40:05 EDT ---

Created attachment 491962 [details]
proposed patch

I had some discussion about this issue with asalkeld on IRC, and he suggested this patch, which seems to work for me.

I wondered though if the refcounting should go further out, say, in cfg_lib_init_fn() and cfg_lib_exit_fn() in services/cfg.c (similar to the init and exit functions in services/cpg.c)?  Or would you like me to take this discussion back to the mailing list?

Thanks,

Tim

--- Additional comment from sdake on 2011-04-14 11:43:20 EDT ---

Tim,

Patch looks good as is.  Please post to ml with signoff and I'll get it merged.  I'd rather not use refcount in lib_init_fn and lib_exit_fn but it was necessary because of the design of cpg.

Regards
-steve

--- Additional comment from jfriesse on 2011-04-15 03:15:32 EDT ---

Patch committed in upstream git as 5b92829d6c20ab550aca6c0b273618577ea92b23

Comment 1 Jan Friesse 2011-04-15 07:16:47 UTC
Patch committed in upstream git as 5b92829d6c20ab550aca6c0b273618577ea92b23

Comment 9 Steven Dake 2011-10-27 18:52:46 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Reference counting on the cfg server in corosync was incorrect.
  Consequence: killing corosync-cfgtool -r before it completed would result in Corosync segfault.
  Fix: Add proper reference counting as per architecture.
  Result: Corosync no longer segfaults with this test case.

Comment 10 errata-xmlrpc 2011-12-06 11:50:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1515.html