1491342 – corosync crashed on SIGBUS (terminated with signal 7, Bus error)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1491342 - corosync crashed on SIGBUS (terminated with signal 7, Bus error)

Summary: corosync crashed on SIGBUS (terminated with signal 7, Bus error)

Keywords:
Status:	CLOSED DUPLICATE of bug 1536219
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-13 14:19 UTC by Josef Zimek
Modified:	2021-06-10 13:01 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-01-30 11:27:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1475482	unspecified	CLOSED	libqb imposed crash in ringbuffer: division by zero	2021-06-10 12:40:59 UTC
Red Hat Bugzilla	1536219	urgent	CLOSED	corosync crashing on afc cluster running SAS Calibration	2021-06-10 14:16:24 UTC
Red Hat Knowledge Base (Solution)	3175191	None	None	None	2018-01-08 21:27:32 UTC

Internal Links: 1475482 1536219

Description Josef Zimek 2017-09-13 14:19:57 UTC

Description of problem:

Corosync crashed producing core dump which seems corrupted. No symbols found despite correct versions and debuginfos installed. The crash was due to SIGBUS which is not common. This bug is to track similar issues in case we see this coming more often. 



===========================
]# gdb -c coredump /sbin/corosync
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/corosync...Reading symbols from /usr/lib/debug/usr/sbin/corosync.debug...done.
done.
[New LWP 2603]
[New LWP 2605]
Failed to read a valid object file image from memory.
Core was generated by `corosync'.
Program terminated with signal 7, Bus error.
#0  0x00007fbee4bc3fc9 in ?? ()


(gdb) t a a bt full

Thread 2 (LWP 2605):
#0  0x00007fbee49aa79b in ?? ()
No symbol table info available.
#1  0x0000000000000005 in ?? ()
No symbol table info available.
#2  0x00007fbee4e1eca0 in ?? ()
No symbol table info available.
#3  0x00007fbee1d92e80 in ?? ()
No symbol table info available.
#4  0x00007fbee49aa82f in ?? ()
No symbol table info available.
#5  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 1 (LWP 2603):
#0  0x00007fbee4bc3fc9 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.



Version-Release number of selected component (if applicable):

corosync-2.4.0-4.el7.x86_64                                 Thu Mar 23 09:09:18 2017
corosynclib-2.4.0-4.el7.x86_64                              Thu Mar 23 09:09:18 2017
libqb-1.0-1.el7.x86_64                                      Thu Mar 23 09:09:17 2017


How reproducible:
one time issue

Steps to Reproduce:
1.NA at this point
2.
3.

Actual results:
corosync crashed on SIGBUS

Expected results:
no crash

Additional info:

Comment 3 Jan Pokorný [poki] 2017-09-18 12:48:35 UTC

In libqb context, SIGBUS used to be raised when trying to access memory
region that was once file-backed, but the original file got truncated
in the iterim.  This hazard introduced with a fix for [bug 1392835] was
subsequently fixed through [bug 1459276].  Note that 1.0-1 package
release should not be affected by any of such changes.  Either way,
updating to equivalent of RHEL 7.4 package set may be worth an attempt.

Comment 4 Jan Pokorný [poki] 2017-09-18 14:42:12 UTC

Looking at the core, not directly consumable as already observed with
[bug 1475482], one of the few at-frame instruction pointers one can
use as a hint is indeed referring to libqb (lib/ringbuffer.c):

 440 void *
 441 qb_rb_chunk_alloc(struct qb_ringbuffer_s * rb, size_t len)
 442 {
 [...]
 467         write_pt = rb->shared_hdr->write_pt;
 468         /*
 469          * insert the chunk header
 470          */
>471         rb->shared_data[write_pt] = 0;
 472         QB_RB_CHUNK_MAGIC_SET(rb, write_pt, QB_RB_CHUNK_MAGIC_ALLOC);
 473 
 474         /*
 475          * return a pointer to the beginning of the chunk data
 476          */
 477         return (void *)QB_RB_CHUNK_DATA_GET(rb, write_pt);
 478 
 479 }

and from subsequent register analysis:

write_pt        := 0 (%rax)
rb->shared_data := 0x7fbee1fa7000 (%rdi)
                   0x7fbee1fa7000 ~ /dev/shm/qb-corosync-2539-blackbox-data

In other words, it seems blackbox feature is involved here, which is
quite similar case as with mentioned [bug 1475482], although the exact
failure point is different, with possibly different root causes.

Comment 6 Jan Friesse 2018-01-30 11:27:15 UTC

@Pepo:
I believe this bug is "clone" of the bug 1536219 so closing this one in it's favor.

*** This bug has been marked as a duplicate of bug 1536219 ***

Note You need to log in before you can comment on or make changes to this bug.