Bug 1025321 - Corosync crash running cpg-init-load test
Corosync crash running cpg-init-load test
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.6
Unspecified Unspecified
high Severity high
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
:
Depends On: 1055584
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-31 09:36 EDT by Jan Friesse
Modified: 2014-10-14 03:11 EDT (History)
5 users (show)

See Also:
Fixed In Version: corosync-1.4.1-18.el6
Doc Type: Bug Fix
Doc Text:
Cause: Application call cpg_finalize (corosync cpg API). Consequence: Corosync (in very rare circumstances) can segfault. Fix: the finalize function is called from a different thread to the init and exit functions so, on a busy system, we can get list corruption. Solution is to handle cpg list removal in same thread as cpg_init. Result: Calling cpg_finalize shouldn't result is corosync segfault.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-10-14 03:11:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jan Friesse 2013-10-31 09:36:53 EDT
Description of problem:
Running https://github.com/jfriesse/csts/blob/master/apps/cpg-init-load.c from time to time causes segfault of corosync. Usually it's *** glibc detected *** corosync: free(): invalid pointer: 0x0000000001c9ae10 *** but it can be double-free, ...

Version-Release number of selected component (if applicable):
Upstream flatiron + RHEL 6.5

How reproducible:
0.000000000001%

Steps to Reproduce:
1. Execute cpg-init-load in cycle

Actual results:
Corosync segault

Expected results:
No segfault

Additional info:
I was trying to find out WHAT is happening by using:
- valgrind - no results. After 24 hours of running, valgrind didn't showed any error
- Duma (ElectrictFence) - Works without any problem
- MALLOC_CHECK_=3 - Shows problem, usually with following bt:
#0  0x00007ffd8d39e8a5 in raise () from /lib64/libc.so.6
#1  0x00007ffd8d3a0085 in abort () from /lib64/libc.so.6
#2  0x00007ffd8d3dc7b7 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffd8d3e20e6 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffd8b744819 in _clear_object (instance=0x1c9a740) at objdb.c:687
#5  0x00007ffd8b7449a2 in object_destroy (object_handle=3522962077188620419) at objdb.c:745
#6  0x00000000004075ba in corosync_stats_destroy_connection (handle=3522962077188620419) at main.c:1259
#7  0x00007ffd8dd2a5af in conn_info_destroy (conn_info=0x21a8bd0) at coroipcs.c:521
#8  0x00007ffd8dd2cce4 in coroipcs_handler_dispatch (fd=75, revent=17, context=0x21a8bd0) at coroipcs.c:1642
#9  0x00000000004071ef in corosync_poll_handler_dispatch (handle=3698059312501882880, fd=75, revent=17, context=0x21a8bd0)
    at main.c:1135
#10 0x00007ffd8e1406cc in poll_run (handle=3698059312501882880) at coropoll.c:513
#11 0x0000000000408e86 in main (argc=2, argv=0x7fff69719188, envp=0x7fff697191a0) at main.c:1941

My theory (for now) is that ether object_destroy is called multiple times or (this is more probable) memory is overwritten somewhere else.
Comment 2 Christine Caulfield 2014-01-07 10:42:29 EST
commit 3c11ea7b84c109e6f8451229437351c5a14c7168
Author: Christine Caulfield <ccaulfie@redhat.com>
Date:   Tue Jan 7 15:38:41 2014 +0000

    cpg: Avoid list corruption
Comment 6 errata-xmlrpc 2014-10-14 03:11:44 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1508.html

Note You need to log in before you can comment on or make changes to this bug.