695191 – segfault due to invalid 'conn' pointer after killing "corosync-cfgtool -r"

Bug 695191 - segfault due to invalid 'conn' pointer after killing "corosync-cfgtool -r"

Summary: segfault due to invalid 'conn' pointer after killing "corosync-cfgtool -r"

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	corosync
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Jan Friesse
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	696887
TreeView+	depends on / blocked

Reported:	2011-04-11 04:13 UTC by Tim Serong
Modified:	2011-04-15 07:16 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Clones:	696887 (view as bug list)
Environment:
Last Closed:	2011-04-15 07:15:32 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
proposed patch (827 bytes, patch) 2011-04-14 05:40 UTC, Tim Serong	no flags	Details \| Diff
View All

Description Tim Serong 2011-04-11 04:13:04 UTC

Description of problem:

If you run "corosync-cfgtool -r" and hit CTRL-C to terminate it (may require quick fingers), corosync will segfault.

Version-Release number of selected component (if applicable):

1.3.0

How reproducible:

Reasonably easy to reproduce.

Steps to Reproduce:
1. Configure two rings (rrp_mode: passive, if that matters).
2. Start corosync.
3. Drop traffic to one of the rings - in my case, ring 0, via
   "iptables -A OUTPUT -d 192.168.4.186 -j DROP".  This will tend
   to make "corosync-cfgtool -r" block a bit.
4. Run "corosync-cfgtool -r" and immediately hit CTRL-C.  You might
   have to do this a couple of times (or put it in a while loop in
   the shell and hit CTRL-C at random), but if you do terminate
   corosync-cfgtool, corosync itself will immediately segfault.

Actual results:

Corosync segfaults with the following backtrace:

(gdb) bt
#0  memcpy () at ../sysdeps/i386/i686/memcpy.S:100
#1  0xb774ded3 in coroipcs_response_send (conn=0x80f6f88,
    msg=0xbf7f7c90, mlen=24) at /usr/include/bits/string3.h:52
#2  0xada9dab4 in message_handler_req_exec_cfg_ringreenable
    (message=0xbf7f7d20, nodeid=956606656) at cfg.c:583
#3  0x0804de75 in deliver_fn (nodeid=956606656, msg=0xbf7f7d20,
    msg_len=32, endian_conversion_required=0) at main.c:852
#4  0xb77793c0 in app_deliver_fn (nodeid=956606656, msg=<value
    optimized out>, msg_len=37, endian_conversion_required=0)
    at totempg.c:506
#5  0xb777991e in totempg_deliver_fn (nodeid=956606656,
    msg=0x80f6fea, msg_len=37, endian_conversion_required=0)
    at totempg.c:618
#6  0xb7777a25 in totemmrp_deliver_fn (nodeid=956606656,
    msg=0x80f6fea, msg_len=47, endian_conversion_required=0) at
    totemmrp.c:98
#7  0xb776ed81 in messages_deliver_to_app (instance=0xb5bb4008,
    skip=0, end_point=40) at totemsrp.c:3704
#8  0xb77752a9 in message_handler_orf_token (instance=0xb5bb4008,
    msg=0x80c9dbc, msg_len=71, endian_conversion_needed=0) at
    totemsrp.c:3577
#9  0xb776d3c9 in main_deliver_fn (context=0xb5bb4008, msg=0x80c9dbc,
    msg_len=71) at totemsrp.c:4356
#10 0xb776b27b in passive_token_recv (rrp_instance=0x8082be8,
    iface_no=1, context=0xb5bb4008, msg=0x80c9dbc, msg_len=71, 
    token_seq=539) at totemrrp.c:876
#11 0xb776a4a3 in rrp_deliver_fn (context=0x8089358, msg=0x80c9dbc,
    msg_len=71) at totemrrp.c:1500
#12 0xb77651d2 in net_deliver_fn (handle=5880381755227111424, fd=13,
    revents=1, data=0x80c9758) at totemudp.c:1244
#13 0xb7761100 in poll_run (handle=5880381755227111424) at
    coropoll.c:510
#14 0x0804f508 in main (argc=Cannot access memory at address 0x6)
    at main.c:1813

Expected results:

No segfault.

Additional info:

The relevant code is:

int coroipcs_response_send (void *conn, const void *msg, size_t mlen)
{
    struct conn_info *conn_info = (struct conn_info *)conn;

    memcpy (conn_info->response_buffer, msg, mlen);

    ipc_sem_post (conn_info->control_buffer, SEMAPHORE_RESPONSE);

    api->stats_increment_value (conn_info->stats_handle, "responses");
    return (0);
}

At this point, 'conn' points to junk, presumably because
corosync-cfgtool was terminated after sending the ring re-enable
message but before receiving a response, so there's no "other end" for
the IPC.

Comment 1 Tim Serong 2011-04-14 05:40:05 UTC

Created attachment 491962 [details]
proposed patch

I had some discussion about this issue with asalkeld on IRC, and he suggested this patch, which seems to work for me.

I wondered though if the refcounting should go further out, say, in cfg_lib_init_fn() and cfg_lib_exit_fn() in services/cfg.c (similar to the init and exit functions in services/cpg.c)?  Or would you like me to take this discussion back to the mailing list?

Thanks,

Tim

Comment 2 Steven Dake 2011-04-14 15:43:20 UTC

Tim,

Patch looks good as is.  Please post to ml with signoff and I'll get it merged.  I'd rather not use refcount in lib_init_fn and lib_exit_fn but it was necessary because of the design of cpg.

Regards
-steve

Comment 3 Jan Friesse 2011-04-15 07:15:32 UTC

Patch committed in upstream git as 5b92829d6c20ab550aca6c0b273618577ea92b23

Note You need to log in before you can comment on or make changes to this bug.