696887 – segfault due to invalid 'conn' pointer after killing "corosync-cfgtool -r"

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 696887 - segfault due to invalid 'conn' pointer after killing "corosync-cfgtool -r"

Summary: segfault due to invalid 'conn' pointer after killing "corosync-cfgtool -r"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	6.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	695191
Blocks:
TreeView+	depends on / blocked

Reported:	2011-04-15 07:16 UTC by Jan Friesse
Modified:	2011-12-06 11:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:	corosync-1.4.0-1.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Reference counting on the cfg server in corosync was incorrect. Consequence: killing corosync-cfgtool -r before it completed would result in Corosync segfault. Fix: Add proper reference counting as per architecture. Result: Corosync no longer segfaults with this test case.
Clone Of:	695191
Environment:
Last Closed:	2011-12-06 11:50:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1515	0	normal	SHIPPED_LIVE	corosync bug fix and enhancement update	2011-12-06 00:38:47 UTC

Description Jan Friesse 2011-04-15 07:16:31 UTC

+++ This bug was initially created as a clone of Bug #695191 +++

Description of problem:

If you run "corosync-cfgtool -r" and hit CTRL-C to terminate it (may require quick fingers), corosync will segfault.

Version-Release number of selected component (if applicable):

1.3.0

How reproducible:

Reasonably easy to reproduce.

Steps to Reproduce:
1. Configure two rings (rrp_mode: passive, if that matters).
2. Start corosync.
3. Drop traffic to one of the rings - in my case, ring 0, via
   "iptables -A OUTPUT -d 192.168.4.186 -j DROP".  This will tend
   to make "corosync-cfgtool -r" block a bit.
4. Run "corosync-cfgtool -r" and immediately hit CTRL-C.  You might
   have to do this a couple of times (or put it in a while loop in
   the shell and hit CTRL-C at random), but if you do terminate
   corosync-cfgtool, corosync itself will immediately segfault.

Actual results:

Corosync segfaults with the following backtrace:

(gdb) bt
#0  memcpy () at ../sysdeps/i386/i686/memcpy.S:100
#1  0xb774ded3 in coroipcs_response_send (conn=0x80f6f88,
    msg=0xbf7f7c90, mlen=24) at /usr/include/bits/string3.h:52
#2  0xada9dab4 in message_handler_req_exec_cfg_ringreenable
    (message=0xbf7f7d20, nodeid=956606656) at cfg.c:583
#3  0x0804de75 in deliver_fn (nodeid=956606656, msg=0xbf7f7d20,
    msg_len=32, endian_conversion_required=0) at main.c:852
#4  0xb77793c0 in app_deliver_fn (nodeid=956606656, msg=<value
    optimized out>, msg_len=37, endian_conversion_required=0)
    at totempg.c:506
#5  0xb777991e in totempg_deliver_fn (nodeid=956606656,
    msg=0x80f6fea, msg_len=37, endian_conversion_required=0)
    at totempg.c:618
#6  0xb7777a25 in totemmrp_deliver_fn (nodeid=956606656,
    msg=0x80f6fea, msg_len=47, endian_conversion_required=0) at
    totemmrp.c:98
#7  0xb776ed81 in messages_deliver_to_app (instance=0xb5bb4008,
    skip=0, end_point=40) at totemsrp.c:3704
#8  0xb77752a9 in message_handler_orf_token (instance=0xb5bb4008,
    msg=0x80c9dbc, msg_len=71, endian_conversion_needed=0) at
    totemsrp.c:3577
#9  0xb776d3c9 in main_deliver_fn (context=0xb5bb4008, msg=0x80c9dbc,
    msg_len=71) at totemsrp.c:4356
#10 0xb776b27b in passive_token_recv (rrp_instance=0x8082be8,
    iface_no=1, context=0xb5bb4008, msg=0x80c9dbc, msg_len=71, 
    token_seq=539) at totemrrp.c:876
#11 0xb776a4a3 in rrp_deliver_fn (context=0x8089358, msg=0x80c9dbc,
    msg_len=71) at totemrrp.c:1500
#12 0xb77651d2 in net_deliver_fn (handle=5880381755227111424, fd=13,
    revents=1, data=0x80c9758) at totemudp.c:1244
#13 0xb7761100 in poll_run (handle=5880381755227111424) at
    coropoll.c:510
#14 0x0804f508 in main (argc=Cannot access memory at address 0x6)
    at main.c:1813

Expected results:

No segfault.

Additional info:

The relevant code is:

int coroipcs_response_send (void *conn, const void *msg, size_t mlen)
{
    struct conn_info *conn_info = (struct conn_info *)conn;

    memcpy (conn_info->response_buffer, msg, mlen);

    ipc_sem_post (conn_info->control_buffer, SEMAPHORE_RESPONSE);

    api->stats_increment_value (conn_info->stats_handle, "responses");
    return (0);
}

At this point, 'conn' points to junk, presumably because
corosync-cfgtool was terminated after sending the ring re-enable
message but before receiving a response, so there's no "other end" for
the IPC.

--- Additional comment from tserong on 2011-04-14 01:40:05 EDT ---

Created attachment 491962 [details]
proposed patch

I had some discussion about this issue with asalkeld on IRC, and he suggested this patch, which seems to work for me.

I wondered though if the refcounting should go further out, say, in cfg_lib_init_fn() and cfg_lib_exit_fn() in services/cfg.c (similar to the init and exit functions in services/cpg.c)?  Or would you like me to take this discussion back to the mailing list?

Thanks,

Tim

--- Additional comment from sdake on 2011-04-14 11:43:20 EDT ---

Tim,

Patch looks good as is.  Please post to ml with signoff and I'll get it merged.  I'd rather not use refcount in lib_init_fn and lib_exit_fn but it was necessary because of the design of cpg.

Regards
-steve

--- Additional comment from jfriesse on 2011-04-15 03:15:32 EDT ---

Patch committed in upstream git as 5b92829d6c20ab550aca6c0b273618577ea92b23

Comment 1 Jan Friesse 2011-04-15 07:16:47 UTC

Patch committed in upstream git as 5b92829d6c20ab550aca6c0b273618577ea92b23

Comment 9 Steven Dake 2011-10-27 18:52:46 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Reference counting on the cfg server in corosync was incorrect.
  Consequence: killing corosync-cfgtool -r before it completed would result in Corosync segfault.
  Fix: Add proper reference counting as per architecture.
  Result: Corosync no longer segfaults with this test case.

Comment 10 errata-xmlrpc 2011-12-06 11:50:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1515.html

Note You need to log in before you can comment on or make changes to this bug.