Bug 1025321

Summary: Corosync crash running cpg-init-load test
Product: Red Hat Enterprise Linux 6 Reporter: Jan Friesse <jfriesse>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.6CC: ccaulfie, cluster-maint, fdinitto, jharriga, jkortus
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-1.4.1-18.el6 Doc Type: Bug Fix
Doc Text:
Cause: Application call cpg_finalize (corosync cpg API). Consequence: Corosync (in very rare circumstances) can segfault. Fix: the finalize function is called from a different thread to the init and exit functions so, on a busy system, we can get list corruption. Solution is to handle cpg list removal in same thread as cpg_init. Result: Calling cpg_finalize shouldn't result is corosync segfault.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-14 07:11:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1055584    
Bug Blocks:    

Description Jan Friesse 2013-10-31 13:36:53 UTC
Description of problem:
Running https://github.com/jfriesse/csts/blob/master/apps/cpg-init-load.c from time to time causes segfault of corosync. Usually it's *** glibc detected *** corosync: free(): invalid pointer: 0x0000000001c9ae10 *** but it can be double-free, ...

Version-Release number of selected component (if applicable):
Upstream flatiron + RHEL 6.5

How reproducible:
0.000000000001%

Steps to Reproduce:
1. Execute cpg-init-load in cycle

Actual results:
Corosync segault

Expected results:
No segfault

Additional info:
I was trying to find out WHAT is happening by using:
- valgrind - no results. After 24 hours of running, valgrind didn't showed any error
- Duma (ElectrictFence) - Works without any problem
- MALLOC_CHECK_=3 - Shows problem, usually with following bt:
#0  0x00007ffd8d39e8a5 in raise () from /lib64/libc.so.6
#1  0x00007ffd8d3a0085 in abort () from /lib64/libc.so.6
#2  0x00007ffd8d3dc7b7 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffd8d3e20e6 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffd8b744819 in _clear_object (instance=0x1c9a740) at objdb.c:687
#5  0x00007ffd8b7449a2 in object_destroy (object_handle=3522962077188620419) at objdb.c:745
#6  0x00000000004075ba in corosync_stats_destroy_connection (handle=3522962077188620419) at main.c:1259
#7  0x00007ffd8dd2a5af in conn_info_destroy (conn_info=0x21a8bd0) at coroipcs.c:521
#8  0x00007ffd8dd2cce4 in coroipcs_handler_dispatch (fd=75, revent=17, context=0x21a8bd0) at coroipcs.c:1642
#9  0x00000000004071ef in corosync_poll_handler_dispatch (handle=3698059312501882880, fd=75, revent=17, context=0x21a8bd0)
    at main.c:1135
#10 0x00007ffd8e1406cc in poll_run (handle=3698059312501882880) at coropoll.c:513
#11 0x0000000000408e86 in main (argc=2, argv=0x7fff69719188, envp=0x7fff697191a0) at main.c:1941

My theory (for now) is that ether object_destroy is called multiple times or (this is more probable) memory is overwritten somewhere else.

Comment 2 Christine Caulfield 2014-01-07 15:42:29 UTC
commit 3c11ea7b84c109e6f8451229437351c5a14c7168
Author: Christine Caulfield <ccaulfie>
Date:   Tue Jan 7 15:38:41 2014 +0000

    cpg: Avoid list corruption

Comment 6 errata-xmlrpc 2014-10-14 07:11:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1508.html