1967485 – Interleaved stats iterators can cause corosync to crash

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1967485 - Interleaved stats iterators can cause corosync to crash

Summary: Interleaved stats iterators can cause corosync to crash

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	CentOS Stream
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	beta
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1962139
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-03 08:22 UTC by Jan Friesse
Modified:	2021-12-07 21:32 UTC (History)
CC List:	8 users (show)
Fixed In Version:	corosync-3.1.4-1.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1962139
Environment:
Last Closed:	2021-12-07 21:30:47 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Description Jan Friesse 2021-06-03 08:22:31 UTC

+++ This bug was initially created as a clone of Bug #1962139 +++

Description of problem:
Running two or more corosync-cmaptool -mstats instances at the same time can crash corosync

Version-Release number of selected component (if applicable):
3.1.2

How reproducible:
Easily

Steps to Reproduce:
1. start corosync (one node is fine)
2. while [ $? = 0 ]; do corosync-cmapctl -mstats; done > /dev/null &
3. while [ $? = 0 ]; do corosync-cmapctl -mstats; done > /dev/null &

Actual results:
corosync will crash

Expected results:
No crash

Additional info:
I'm looking into this ATM. It seems that maybe the deleted keys are persisting in the trie somehow, but all the data has been freed.


Thread 1 "corosync" received signal SIGSEGV, Segmentation fault.
0x000055555557def0 in stats_map_get (key_name=0x7fffe889da78 "stats.ipcs.service0.27346.0x555557b30d00.procname", value=0x0, 
    value_len=0x7fffffffca88, type=0x7fffffffca84) at stats.c:339
339		switch (statinfo->type) {
Missing separate debuginfos, use: yum debuginfo-install corosynclib-3.1.0-3.el8.x86_64 libffi-3.1-22.el8.x86_64 libknet1-1.18-1.el8.x86_64 libknet1-crypto-nss-plugin-1.18-1.el8.x86_64 libqb-1.0.3-12.el8.x86_64 libtasn1-4.13-3.el8+7.x86_64 nspr-4.25.0-2.el8_2.x86_64 nss-3.53.1-17.el8_3.x86_64 nss-softokn-3.53.1-17.el8_3.x86_64 nss-softokn-freebl-3.53.1-17.el8_3.x86_64 nss-util-3.53.1-17.el8_3.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 p11-kit-0.23.22-1.el8.x86_64 p11-kit-trust-0.23.22-1.el8.x86_64 sqlite-libs-3.26.0-13.el8.x86_64 zlib-1.2.11-17.el8.x86_64
(gdb) 
(gdb) bt
#0  0x000055555557def0 in stats_map_get (key_name=0x7fffe889da78 "stats.ipcs.service0.27346.0x555557b30d00.procname", value=0x0, 
    value_len=0x7fffffffca88, type=0x7fffffffca84) at stats.c:339
#1  0x000055555556892f in message_handler_req_lib_cmap_get (conn=0x555557b4f6c0, message=0x7fffe889da60) at cmap.c:561
#2  0x0000555555580226 in cs_ipcs_msg_process (c=0x555557b4f6c0, data=0x7fffe889da60, size=288) at ipc_glue.c:558
#3  0x00007ffff7976641 in qb_ipcs_dispatch_connection_request () from /lib64/libqb.so.0
#4  0x00007ffff7972de3 in _poll_dispatch_and_take_back_ () from /lib64/libqb.so.0
#5  0x00007ffff797285c in qb_loop_run () from /lib64/libqb.so.0
#6  0x000055555557b5fc in main (argc=2, argv=0x7fffffffe388, envp=0x7fffffffe3a0) at main.c:1601
(gdb) p item
$1 = (struct stats_item *) 0x555557b75850
(gdb) p *item
$2 = {key_name = 0x555557b7bd60 "\020X\267WUU", cs_conv = 0x60a4f82b}
(gdb) p item->key_name
$3 = 0x555557b7bd60 "\020X\267WUU"
(gdb) p item->cs_conv
$4 = (struct cs_stats_conv *) 0x60a4f82b
(gdb) p *item->cs_conv
Cannot access memory at address 0x60a4f82b
(gdb)

--- Additional comment from Christine Caulfield on 2021-05-19 12:07:31 UTC ---

It's libqb map iterators. OF COURSE it's libqb map iterators.

If an entry is removed from the map and an existing iterator has a pointer to that entry, the the application will crash when trying to dereference it. This seems to be true of both trie and skiplist map types, hashtable seems to be fine.

--- Additional comment from Jan Friesse on 2021-06-03 08:20:48 UTC ---

Reassigning back to corosync because we got a fix there (https://github.com/corosync/corosync/pull/645). LibQB might get a better API, but that's long(er) term project (https://github.com/ClusterLabs/libqb/issues/441)

Comment 4 Patrik Hagara 2021-08-19 13:36:51 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1962139#c7 for before-the-fix reproducer.


after fix
=========

> [root@virt-517 ~]# rpm -q corosync
> corosync-3.1.5-1.el9.x86_64

Start these 3 commands in parallel:
* journalctl -f -u corosync
* while corosync-cmapctl -mstats; do continue; done > /dev/null
* while corosync-cmapctl -mstats; do continue; done > /dev/null

Result: corosync does not crash or log any errors (even after letting the corosync-cmapctl processes loop for a few minutes), but both corosync-cmapctl processes occasionally print non-fatal errors (exit code is always zero/success) such as:

> Can't get value of stats.ipcs.service0.121900.0x562acaa5c070.sent. Error CS_ERR_NOT_EXIST

It's always the IPC service0 in these error messages. Examining the full output reveals the following:

> [root@virt-517 ~]# corosync-cmapctl -mstats stats.ipcs.service0
> stats.ipcs.service0.177855.0x562aca339120.dispatched (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.flow_control (u32) = 0
> stats.ipcs.service0.177855.0x562aca339120.flow_control_count (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.invalid_request (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.overload (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.procname (str) = corosync-cmapct
> stats.ipcs.service0.177855.0x562aca339120.queued (u32) = 0
> stats.ipcs.service0.177855.0x562aca339120.queueing (i32) = 0
> stats.ipcs.service0.177855.0x562aca339120.recv_retries (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.requests (u64) = 23
> stats.ipcs.service0.177855.0x562aca339120.responses (u64) = 24
> stats.ipcs.service0.177855.0x562aca339120.send_retries (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.sent (u32) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.dispatched (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.flow_control (u32) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.flow_control_count (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.invalid_request (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.overload (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.procname (str) = corosync-cmapct
> stats.ipcs.service0.177856.0x562acaa58fc0.queued (u32) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.queueing (i32) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.recv_retries (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.requests (u64) = 146
> stats.ipcs.service0.177856.0x562acaa58fc0.responses (u64) = 146
> stats.ipcs.service0.177856.0x562acaa58fc0.send_retries (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.sent (u32) = 0

When both our corosync-cmapctl processes happen to be running at the same time, the IPC service0 entries contain the stats for both. The cmapctl processes are seeing and reading each other's IPC connection stats. When process A happens to be in the middle of reading process B's stats and process B exits, corosync cleans up the connection stats before process A can finish reading them and it results in the above mentioned error message. But the corosync daemon does not crash now with this fix.

See also https://bugzilla.redhat.com/show_bug.cgi?id=1962139#c8 as the same applies here -- the remaining CS_ERR_NOT_EXIST errors are fine.

Note You need to log in before you can comment on or make changes to this bug.