1025321 – Corosync crash running cpg-init-load test

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1025321 - Corosync crash running cpg-init-load test

Summary: Corosync crash running cpg-init-load test

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	6.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	1055584
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-31 13:36 UTC by Jan Friesse
Modified:	2014-10-14 07:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:	corosync-1.4.1-18.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Application call cpg_finalize (corosync cpg API). Consequence: Corosync (in very rare circumstances) can segfault. Fix: the finalize function is called from a different thread to the init and exit functions so, on a busy system, we can get list corruption. Solution is to handle cpg list removal in same thread as cpg_init. Result: Calling cpg_finalize shouldn't result is corosync segfault.
Clone Of:
Environment:
Last Closed:	2014-10-14 07:11:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1508	0	normal	SHIPPED_LIVE	corosync bug fix update	2014-10-14 01:22:31 UTC

Description Jan Friesse 2013-10-31 13:36:53 UTC

Description of problem:
Running https://github.com/jfriesse/csts/blob/master/apps/cpg-init-load.c from time to time causes segfault of corosync. Usually it's *** glibc detected *** corosync: free(): invalid pointer: 0x0000000001c9ae10 *** but it can be double-free, ...

Version-Release number of selected component (if applicable):
Upstream flatiron + RHEL 6.5

How reproducible:
0.000000000001%

Steps to Reproduce:
1. Execute cpg-init-load in cycle

Actual results:
Corosync segault

Expected results:
No segfault

Additional info:
I was trying to find out WHAT is happening by using:
- valgrind - no results. After 24 hours of running, valgrind didn't showed any error
- Duma (ElectrictFence) - Works without any problem
- MALLOC_CHECK_=3 - Shows problem, usually with following bt:
#0  0x00007ffd8d39e8a5 in raise () from /lib64/libc.so.6
#1  0x00007ffd8d3a0085 in abort () from /lib64/libc.so.6
#2  0x00007ffd8d3dc7b7 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffd8d3e20e6 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffd8b744819 in _clear_object (instance=0x1c9a740) at objdb.c:687
#5  0x00007ffd8b7449a2 in object_destroy (object_handle=3522962077188620419) at objdb.c:745
#6  0x00000000004075ba in corosync_stats_destroy_connection (handle=3522962077188620419) at main.c:1259
#7  0x00007ffd8dd2a5af in conn_info_destroy (conn_info=0x21a8bd0) at coroipcs.c:521
#8  0x00007ffd8dd2cce4 in coroipcs_handler_dispatch (fd=75, revent=17, context=0x21a8bd0) at coroipcs.c:1642
#9  0x00000000004071ef in corosync_poll_handler_dispatch (handle=3698059312501882880, fd=75, revent=17, context=0x21a8bd0)
    at main.c:1135
#10 0x00007ffd8e1406cc in poll_run (handle=3698059312501882880) at coropoll.c:513
#11 0x0000000000408e86 in main (argc=2, argv=0x7fff69719188, envp=0x7fff697191a0) at main.c:1941

My theory (for now) is that ether object_destroy is called multiple times or (this is more probable) memory is overwritten somewhere else.

Comment 2 Christine Caulfield 2014-01-07 15:42:29 UTC

commit 3c11ea7b84c109e6f8451229437351c5a14c7168
Author: Christine Caulfield <ccaulfie>
Date:   Tue Jan 7 15:38:41 2014 +0000

    cpg: Avoid list corruption

Comment 6 errata-xmlrpc 2014-10-14 07:11:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1508.html

Note You need to log in before you can comment on or make changes to this bug.