Bug 1962139

Summary: Interleaved stats iterators can cause corosync to crash
Product: Red Hat Enterprise Linux 8 Reporter: Christine Caulfield <ccaulfie>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0CC: ccaulfie, cluster-maint, jfriesse, kgaillot, phagara
Target Milestone: betaKeywords: Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-3.1.4-1.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1967485 (view as bug list) Environment:
Last Closed: 2021-11-09 17:49:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1967485    

Description Christine Caulfield 2021-05-19 11:41:28 UTC
Description of problem:
Running two or more corosync-cmaptool -mstats instances at the same time can crash corosync

Version-Release number of selected component (if applicable):
3.1.2

How reproducible:
Easily

Steps to Reproduce:
1. start corosync (one node is fine)
2. while [ $? = 0 ]; do corosync-cmapctl -mstats; done > /dev/null &
3. while [ $? = 0 ]; do corosync-cmapctl -mstats; done > /dev/null &

Actual results:
corosync will crash

Expected results:
No crash

Additional info:
I'm looking into this ATM. It seems that maybe the deleted keys are persisting in the trie somehow, but all the data has been freed.


Thread 1 "corosync" received signal SIGSEGV, Segmentation fault.
0x000055555557def0 in stats_map_get (key_name=0x7fffe889da78 "stats.ipcs.service0.27346.0x555557b30d00.procname", value=0x0, 
    value_len=0x7fffffffca88, type=0x7fffffffca84) at stats.c:339
339		switch (statinfo->type) {
Missing separate debuginfos, use: yum debuginfo-install corosynclib-3.1.0-3.el8.x86_64 libffi-3.1-22.el8.x86_64 libknet1-1.18-1.el8.x86_64 libknet1-crypto-nss-plugin-1.18-1.el8.x86_64 libqb-1.0.3-12.el8.x86_64 libtasn1-4.13-3.el8+7.x86_64 nspr-4.25.0-2.el8_2.x86_64 nss-3.53.1-17.el8_3.x86_64 nss-softokn-3.53.1-17.el8_3.x86_64 nss-softokn-freebl-3.53.1-17.el8_3.x86_64 nss-util-3.53.1-17.el8_3.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 p11-kit-0.23.22-1.el8.x86_64 p11-kit-trust-0.23.22-1.el8.x86_64 sqlite-libs-3.26.0-13.el8.x86_64 zlib-1.2.11-17.el8.x86_64
(gdb) 
(gdb) bt
#0  0x000055555557def0 in stats_map_get (key_name=0x7fffe889da78 "stats.ipcs.service0.27346.0x555557b30d00.procname", value=0x0, 
    value_len=0x7fffffffca88, type=0x7fffffffca84) at stats.c:339
#1  0x000055555556892f in message_handler_req_lib_cmap_get (conn=0x555557b4f6c0, message=0x7fffe889da60) at cmap.c:561
#2  0x0000555555580226 in cs_ipcs_msg_process (c=0x555557b4f6c0, data=0x7fffe889da60, size=288) at ipc_glue.c:558
#3  0x00007ffff7976641 in qb_ipcs_dispatch_connection_request () from /lib64/libqb.so.0
#4  0x00007ffff7972de3 in _poll_dispatch_and_take_back_ () from /lib64/libqb.so.0
#5  0x00007ffff797285c in qb_loop_run () from /lib64/libqb.so.0
#6  0x000055555557b5fc in main (argc=2, argv=0x7fffffffe388, envp=0x7fffffffe3a0) at main.c:1601
(gdb) p item
$1 = (struct stats_item *) 0x555557b75850
(gdb) p *item
$2 = {key_name = 0x555557b7bd60 "\020X\267WUU", cs_conv = 0x60a4f82b}
(gdb) p item->key_name
$3 = 0x555557b7bd60 "\020X\267WUU"
(gdb) p item->cs_conv
$4 = (struct cs_stats_conv *) 0x60a4f82b
(gdb) p *item->cs_conv
Cannot access memory at address 0x60a4f82b
(gdb)

Comment 1 Christine Caulfield 2021-05-19 12:07:31 UTC
It's libqb map iterators. OF COURSE it's libqb map iterators.

If an entry is removed from the map and an existing iterator has a pointer to that entry, the the application will crash when trying to dereference it. This seems to be true of both trie and skiplist map types, hashtable seems to be fine.

Comment 2 Jan Friesse 2021-06-03 08:20:48 UTC
Reassigning back to corosync because we got a fix there (https://github.com/corosync/corosync/pull/645). LibQB might get a better API, but that's long(er) term project (https://github.com/ClusterLabs/libqb/issues/441)

Comment 7 Patrik Hagara 2021-08-19 12:36:12 UTC
before fix
==========

> [root@virt-494 ~]# rpm -q corosync
> corosync-3.1.0-3.el8.x86_64


Start these 3 commands in parallel:
* journalctl -fu corosync
* while corosync-cmapctl -mstats; do continue; done > /dev/null
* while corosync-cmapctl -mstats; do continue; done > /dev/null

Once the second corosync-cmapctl command starts to be executed, both commands occasionally log non-fatal errors to stderr, eg.:

> Can't get value of stats.ipcs.service0.3517.0x564d248fd550.invalid_request. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.3717.0x564d248f7940.invalid_request. Error CS_ERR_INVALID_PARAM
> Can't get value of stats.ipcs.service0.3730.0x564d248fd550.procname. Error CS_ERR_NOT_EXIST
> Can't get value of stats.ipcs.service0.3781.0x564d248fd550.flow_control_count. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4011.0x564d248fd550.procname. Error CS_ERR_NOT_EXIST
> Can't get value of stats.ipcs.service0.4021.0x564d248fd550.flow_control_count. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4129.0x564d248fac40.requests. Error CS_ERR_INVALID_PARAM
> Can't get value of stats.ipcs.service0.4196.0x564d248fac40.flow_control. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4294.0x564d248fac40.queued. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4329.0x564d248fd550.requests. Error CS_ERR_INVALID_PARAM
> Can't get value of stats.ipcs.service0.4334.0x564d248fd550.flow_control. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4336.0x564d248fd550.dispatched. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4354.0x564d248fd550.flow_control. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4473.0x564d248fd550.procname. Error CS_ERR_NOT_EXIST
> Can't get value of stats.ipcs.service0.4556.0x564d248fac40.flow_control. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4593.0x564d248fd550.responses. Error CS_ERR_LIBRARY
> Can't get value of stats.ipcs.service0.4648.0x564d248fd550.procname. Error CS_ERR_LIBRARY

After a while, corosync crashes:

> Aug 19 14:20:35 virt-494.cluster-qe.lab.eng.brq.redhat.com systemd[1]: corosync.service: Main process exited, code=dumped, status=11/SEGV                                                                                                                                       
> Aug 19 14:20:35 virt-494.cluster-qe.lab.eng.brq.redhat.com systemd[1]: corosync.service: Failed with result 'core-dump'.

and both command loops exit non-zero with the following error message:

> Failed to initialize the cmap API. Error CS_ERR_LIBRARY



after fix
=========

> [root@virt-020 ~]# rpm -q corosync
> corosync-3.1.5-1.el8.x86_64

Same procedure as before, start these 3 commands in parallel:
* journalctl -fu corosync
* while corosync-cmapctl -mstats; do continue; done > /dev/null
* while corosync-cmapctl -mstats; do continue; done > /dev/null

Result: corosync does not crash or log any errors (even after letting the corosync-cmapctl processes loop for a few minutes), but both corosync-cmapctl processes still occasionally print non-fatal errors (exit code is always zero/success) such as:

> Can't get value of stats.ipcs.service0.77279.0x5625ccf39fb0.responses. Error CS_ERR_NOT_EXIST

The logged error is now always CS_ERR_NOT_EXIST, no CS_ERR_LIBRARY or CS_ERR_INVALID_PARAM (like before the fix) were observed.

It's always the IPC service0 in these error messages. Examining the full output reveals the following:

> [root@virt-020 ~]# corosync-cmapctl -mstats stats.ipcs.service0
> stats.ipcs.service0.96611.0x5625ccf379a0.dispatched (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.flow_control (u32) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.flow_control_count (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.invalid_request (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.overload (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.procname (str) = corosync-cmapct
> stats.ipcs.service0.96611.0x5625ccf379a0.queued (u32) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.queueing (i32) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.recv_retries (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.requests (u64) = 23
> stats.ipcs.service0.96611.0x5625ccf379a0.responses (u64) = 24
> stats.ipcs.service0.96611.0x5625ccf379a0.send_retries (u64) = 0
> stats.ipcs.service0.96611.0x5625ccf379a0.sent (u32) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.dispatched (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.flow_control (u32) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.flow_control_count (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.invalid_request (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.overload (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.procname (str) = corosync-cmapct
> stats.ipcs.service0.96612.0x5625ccf39fb0.queued (u32) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.queueing (i32) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.recv_retries (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.requests (u64) = 119
> stats.ipcs.service0.96612.0x5625ccf39fb0.responses (u64) = 121
> stats.ipcs.service0.96612.0x5625ccf39fb0.send_retries (u64) = 0
> stats.ipcs.service0.96612.0x5625ccf39fb0.sent (u32) = 0

When both our corosync-cmapctl processes happen to be running at the same time, the IPC service0 entries contain the stats for both. The cmapctl processes are seeing and reading each other's IPC connection stats. When process A happens to be in the middle of reading process B's stats and process B exits, corosync cleans up the connection stats before process A can finish reading them and it results in the above mentioned error message. But the corosync daemon does not crash now with this fix.

Comment 8 Christine Caulfield 2021-08-19 12:51:19 UTC
Just to be clear, those ERR_NOT_EXIST errors are to be expected when several clients are accessing corosync at the same time and at least one is iterating through the stats. There's an unsolvable race condition between getting the client ID and getting the stats associated with it, so that error is quite a reasonable thing to return and clients should be able to handle it.

There's an argument to be had that corosync-cmaptool should not return the error when running, but, TBH it's not really for general use.

Comment 10 errata-xmlrpc 2021-11-09 17:49:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4169