1962139 – Interleaved stats iterators can cause corosync to crash

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1962139 - Interleaved stats iterators can cause corosync to crash

Summary: Interleaved stats iterators can cause corosync to crash

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	beta
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1967485
TreeView+	depends on / blocked

Reported:	2021-05-19 11:41 UTC by Christine Caulfield
Modified:	2021-11-09 21:02 UTC (History)
CC List:	5 users (show)
Fixed In Version:	corosync-3.1.4-1.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1967485 (view as bug list)
Environment:
Last Closed:	2021-11-09 17:49:23 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2021:4169	0	None	None	None	2021-11-09 17:49:33 UTC

Description Christine Caulfield 2021-05-19 11:41:28 UTC

Description of problem:
Running two or more corosync-cmaptool -mstats instances at the same time can crash corosync

Version-Release number of selected component (if applicable):
3.1.2

How reproducible:
Easily

Steps to Reproduce:
1. start corosync (one node is fine)
2. while [ $? = 0 ]; do corosync-cmapctl -mstats; done > /dev/null &
3. while [ $? = 0 ]; do corosync-cmapctl -mstats; done > /dev/null &

Actual results:
corosync will crash

Expected results:
No crash

Additional info:
I'm looking into this ATM. It seems that maybe the deleted keys are persisting in the trie somehow, but all the data has been freed.


Thread 1 "corosync" received signal SIGSEGV, Segmentation fault.
0x000055555557def0 in stats_map_get (key_name=0x7fffe889da78 "stats.ipcs.service0.27346.0x555557b30d00.procname", value=0x0, 
    value_len=0x7fffffffca88, type=0x7fffffffca84) at stats.c:339
339		switch (statinfo->type) {
Missing separate debuginfos, use: yum debuginfo-install corosynclib-3.1.0-3.el8.x86_64 libffi-3.1-22.el8.x86_64 libknet1-1.18-1.el8.x86_64 libknet1-crypto-nss-plugin-1.18-1.el8.x86_64 libqb-1.0.3-12.el8.x86_64 libtasn1-4.13-3.el8+7.x86_64 nspr-4.25.0-2.el8_2.x86_64 nss-3.53.1-17.el8_3.x86_64 nss-softokn-3.53.1-17.el8_3.x86_64 nss-softokn-freebl-3.53.1-17.el8_3.x86_64 nss-util-3.53.1-17.el8_3.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 p11-kit-0.23.22-1.el8.x86_64 p11-kit-trust-0.23.22-1.el8.x86_64 sqlite-libs-3.26.0-13.el8.x86_64 zlib-1.2.11-17.el8.x86_64
(gdb) 
(gdb) bt
#0  0x000055555557def0 in stats_map_get (key_name=0x7fffe889da78 "stats.ipcs.service0.27346.0x555557b30d00.procname", value=0x0, 
    value_len=0x7fffffffca88, type=0x7fffffffca84) at stats.c:339
#1  0x000055555556892f in message_handler_req_lib_cmap_get (conn=0x555557b4f6c0, message=0x7fffe889da60) at cmap.c:561
#2  0x0000555555580226 in cs_ipcs_msg_process (c=0x555557b4f6c0, data=0x7fffe889da60, size=288) at ipc_glue.c:558
#3  0x00007ffff7976641 in qb_ipcs_dispatch_connection_request () from /lib64/libqb.so.0
#4  0x00007ffff7972de3 in _poll_dispatch_and_take_back_ () from /lib64/libqb.so.0
#5  0x00007ffff797285c in qb_loop_run () from /lib64/libqb.so.0
#6  0x000055555557b5fc in main (argc=2, argv=0x7fffffffe388, envp=0x7fffffffe3a0) at main.c:1601
(gdb) p item
$1 = (struct stats_item *) 0x555557b75850
(gdb) p *item
$2 = {key_name = 0x555557b7bd60 "\020X\267WUU", cs_conv = 0x60a4f82b}
(gdb) p item->key_name
$3 = 0x555557b7bd60 "\020X\267WUU"
(gdb) p item->cs_conv
$4 = (struct cs_stats_conv *) 0x60a4f82b
(gdb) p *item->cs_conv
Cannot access memory at address 0x60a4f82b
(gdb)

Comment 1 Christine Caulfield 2021-05-19 12:07:31 UTC

It's libqb map iterators. OF COURSE it's libqb map iterators.

If an entry is removed from the map and an existing iterator has a pointer to that entry, the the application will crash when trying to dereference it. This seems to be true of both trie and skiplist map types, hashtable seems to be fine.

Comment 2 Jan Friesse 2021-06-03 08:20:48 UTC

Reassigning back to corosync because we got a fix there (https://github.com/corosync/corosync/pull/645). LibQB might get a better API, but that's long(er) term project (https://github.com/ClusterLabs/libqb/issues/441)

Comment 7 Patrik Hagara 2021-08-19 12:36:12 UTC

before fix
==========

> [root@virt-494 ~]# rpm -q corosync
> corosync-3.1.0-3.el8.x86_64


Start these 3 commands in parallel:
* journalctl -fu corosync
* while corosync-cmapctl -mstats; do continue; done > /dev/null
* while corosync-cmapctl -mstats; do continue; done > /dev/null

Once the second corosync-cmapctl command starts to be executed, both commands occasionally log non-fatal errors to stderr, eg.:

> Can't get value of stats.ipcs.service0.3517.0x564d248fd550.invalid_request. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.3717.0x564d248f7940.invalid_request. Error CS_ERR_INVALID_PARAM
> Can't get value of stats.ipcs.service0.3730.0x564d248fd550.procname. Error CS_ERR_NOT_EXIST
> Can't get value of stats.ipcs.service0.3781.0x564d248fd550.flow_control_count. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4011.0x564d248fd550.procname. Error CS_ERR_NOT_EXIST
> Can't get value of stats.ipcs.service0.4021.0x564d248fd550.flow_control_count. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4129.0x564d248fac40.requests. Error CS_ERR_INVALID_PARAM
> Can't get value of stats.ipcs.service0.4196.0x564d248fac40.flow_control. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4294.0x564d248fac40.queued. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4329.0x564d248fd550.requests. Error CS_ERR_INVALID_PARAM
> Can't get value of stats.ipcs.service0.4334.0x564d248fd550.flow_control. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4336.0x564d248fd550.dispatched. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4354.0x564d248fd550.flow_control. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4473.0x564d248fd550.procname. Error CS_ERR_NOT_EXIST
> Can't get value of stats.ipcs.service0.4556.0x564d248fac40.flow_control. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4593.0x564d248fd550.responses. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4648.0x564d248fd550.procname. Error CS_ERR_LIBRARY

After a while, corosync crashes:

> Aug 19 14:20:35 virt-494.cluster-qe.lab.eng.brq.redhat.com systemd[1]: corosync.service: Main process exited, code=dumped, status=11/SEGV                                                                                                                                       
> Aug 19 14:20:35 virt-494.cluster-qe.lab.eng.brq.redhat.com systemd[1]: corosync.service: Failed with result 'core-dump'.

and both command loops exit non-zero with the following error message:

> Failed to initialize the cmap API. Error CS_ERR_LIBRARY



after fix
=========

> [root@virt-020 ~]# rpm -q corosync
> corosync-3.1.5-1.el8.x86_64

Same procedure as before, start these 3 commands in parallel:
* journalctl -fu corosync
* while corosync-cmapctl -mstats; do continue; done > /dev/null
* while corosync-cmapctl -mstats; do continue; done > /dev/null

Result: corosync does not crash or log any errors (even after letting the corosync-cmapctl processes loop for a few minutes), but both corosync-cmapctl processes still occasionally print non-fatal errors (exit code is always zero/success) such as:

> Can't get value of stats.ipcs.service0.77279.0x5625ccf39fb0.responses. Error CS_ERR_NOT_EXIST

The logged error is now always CS_ERR_NOT_EXIST, no CS_ERR_LIBRARY or CS_ERR_INVALID_PARAM (like before the fix) were observed.

It's always the IPC service0 in these error messages. Examining the full output reveals the following:

> [root@virt-020 ~]# corosync-cmapctl -mstats stats.ipcs.service0
> stats.ipcs.service0.96611.0x5625ccf379a0.dispatched (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.flow_control (u32) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.flow_control_count (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.invalid_request (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.overload (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.procname (str) = corosync-cmapct
> stats.ipcs.service0.96611.0x5625ccf379a0.queued (u32) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.queueing (i32) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.recv_retries (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.requests (u64) = 23
> stats.ipcs.service0.96611.0x5625ccf379a0.responses (u64) = 24
> stats.ipcs.service0.96611.0x5625ccf379a0.send_retries (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.sent (u32) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.dispatched (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.flow_control (u32) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.flow_control_count (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.invalid_request (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.overload (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.procname (str) = corosync-cmapct
> stats.ipcs.service0.96612.0x5625ccf39fb0.queued (u32) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.queueing (i32) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.recv_retries (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.requests (u64) = 119
> stats.ipcs.service0.96612.0x5625ccf39fb0.responses (u64) = 121
> stats.ipcs.service0.96612.0x5625ccf39fb0.send_retries (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.sent (u32) = 0

When both our corosync-cmapctl processes happen to be running at the same time, the IPC service0 entries contain the stats for both. The cmapctl processes are seeing and reading each other's IPC connection stats. When process A happens to be in the middle of reading process B's stats and process B exits, corosync cleans up the connection stats before process A can finish reading them and it results in the above mentioned error message. But the corosync daemon does not crash now with this fix.

Comment 8 Christine Caulfield 2021-08-19 12:51:19 UTC

Just to be clear, those ERR_NOT_EXIST errors are to be expected when several clients are accessing corosync at the same time and at least one is iterating through the stats. There's an unsolvable race condition between getting the client ID and getting the stats associated with it, so that error is quite a reasonable thing to return and clients should be able to handle it.

There's an argument to be had that corosync-cmaptool should not return the error when running, but, TBH it's not really for general use.

Comment 10 errata-xmlrpc 2021-11-09 17:49:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4169

Note You need to log in before you can comment on or make changes to this bug.