Bug 1967485

Summary:	Interleaved stats iterators can cause corosync to crash
Product:	Red Hat Enterprise Linux 9	Reporter:	Jan Friesse <jfriesse>
Component:	corosync	Assignee:	Jan Friesse <jfriesse>
Status:	CLOSED CURRENTRELEASE	QA Contact:	cluster-qe <cluster-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	CentOS Stream	CC:	bstinson, ccaulfie, cluster-maint, cluster-qe, jfriesse, jwboyer, kgaillot, phagara
Target Milestone:	beta	Keywords:	Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	corosync-3.1.4-1.el9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1962139	Environment:
Last Closed:	2021-12-07 21:30:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1962139
Bug Blocks:

Description Jan Friesse 2021-06-03 08:22:31 UTC

+++ This bug was initially created as a clone of Bug #1962139 +++

Description of problem:
Running two or more corosync-cmaptool -mstats instances at the same time can crash corosync

Version-Release number of selected component (if applicable):
3.1.2

How reproducible:
Easily

Steps to Reproduce:
1. start corosync (one node is fine)
2. while [ $? = 0 ]; do corosync-cmapctl -mstats; done > /dev/null &
3. while [ $? = 0 ]; do corosync-cmapctl -mstats; done > /dev/null &

Actual results:
corosync will crash

Expected results:
No crash

Additional info:
I'm looking into this ATM. It seems that maybe the deleted keys are persisting in the trie somehow, but all the data has been freed.


Thread 1 "corosync" received signal SIGSEGV, Segmentation fault.
0x000055555557def0 in stats_map_get (key_name=0x7fffe889da78 "stats.ipcs.service0.27346.0x555557b30d00.procname", value=0x0, 
    value_len=0x7fffffffca88, type=0x7fffffffca84) at stats.c:339
339		switch (statinfo->type) {
Missing separate debuginfos, use: yum debuginfo-install corosynclib-3.1.0-3.el8.x86_64 libffi-3.1-22.el8.x86_64 libknet1-1.18-1.el8.x86_64 libknet1-crypto-nss-plugin-1.18-1.el8.x86_64 libqb-1.0.3-12.el8.x86_64 libtasn1-4.13-3.el8+7.x86_64 nspr-4.25.0-2.el8_2.x86_64 nss-3.53.1-17.el8_3.x86_64 nss-softokn-3.53.1-17.el8_3.x86_64 nss-softokn-freebl-3.53.1-17.el8_3.x86_64 nss-util-3.53.1-17.el8_3.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 p11-kit-0.23.22-1.el8.x86_64 p11-kit-trust-0.23.22-1.el8.x86_64 sqlite-libs-3.26.0-13.el8.x86_64 zlib-1.2.11-17.el8.x86_64
(gdb) 
(gdb) bt
#0  0x000055555557def0 in stats_map_get (key_name=0x7fffe889da78 "stats.ipcs.service0.27346.0x555557b30d00.procname", value=0x0, 
    value_len=0x7fffffffca88, type=0x7fffffffca84) at stats.c:339
#1  0x000055555556892f in message_handler_req_lib_cmap_get (conn=0x555557b4f6c0, message=0x7fffe889da60) at cmap.c:561
#2  0x0000555555580226 in cs_ipcs_msg_process (c=0x555557b4f6c0, data=0x7fffe889da60, size=288) at ipc_glue.c:558
#3  0x00007ffff7976641 in qb_ipcs_dispatch_connection_request () from /lib64/libqb.so.0
#4  0x00007ffff7972de3 in _poll_dispatch_and_take_back_ () from /lib64/libqb.so.0
#5  0x00007ffff797285c in qb_loop_run () from /lib64/libqb.so.0
#6  0x000055555557b5fc in main (argc=2, argv=0x7fffffffe388, envp=0x7fffffffe3a0) at main.c:1601
(gdb) p item
$1 = (struct stats_item *) 0x555557b75850
(gdb) p *item
$2 = {key_name = 0x555557b7bd60 "\020X\267WUU", cs_conv = 0x60a4f82b}
(gdb) p item->key_name
$3 = 0x555557b7bd60 "\020X\267WUU"
(gdb) p item->cs_conv
$4 = (struct cs_stats_conv *) 0x60a4f82b
(gdb) p *item->cs_conv
Cannot access memory at address 0x60a4f82b
(gdb)

--- Additional comment from Christine Caulfield on 2021-05-19 12:07:31 UTC ---

It's libqb map iterators. OF COURSE it's libqb map iterators.

If an entry is removed from the map and an existing iterator has a pointer to that entry, the the application will crash when trying to dereference it. This seems to be true of both trie and skiplist map types, hashtable seems to be fine.

--- Additional comment from Jan Friesse on 2021-06-03 08:20:48 UTC ---

Reassigning back to corosync because we got a fix there (https://github.com/corosync/corosync/pull/645). LibQB might get a better API, but that's long(er) term project (https://github.com/ClusterLabs/libqb/issues/441)

Comment 4 Patrik Hagara 2021-08-19 13:36:51 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1962139#c7 for before-the-fix reproducer.


after fix
=========

> [root@virt-517 ~]# rpm -q corosync
> corosync-3.1.5-1.el9.x86_64

Start these 3 commands in parallel:
* journalctl -f -u corosync
* while corosync-cmapctl -mstats; do continue; done > /dev/null
* while corosync-cmapctl -mstats; do continue; done > /dev/null

Result: corosync does not crash or log any errors (even after letting the corosync-cmapctl processes loop for a few minutes), but both corosync-cmapctl processes occasionally print non-fatal errors (exit code is always zero/success) such as:

> Can't get value of stats.ipcs.service0.121900.0x562acaa5c070.sent. Error CS_ERR_NOT_EXIST

It's always the IPC service0 in these error messages. Examining the full output reveals the following:

> [root@virt-517 ~]# corosync-cmapctl -mstats stats.ipcs.service0
> stats.ipcs.service0.177855.0x562aca339120.dispatched (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.flow_control (u32) = 0
> stats.ipcs.service0.177855.0x562aca339120.flow_control_count (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.invalid_request (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.overload (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.procname (str) = corosync-cmapct
> stats.ipcs.service0.177855.0x562aca339120.queued (u32) = 0
> stats.ipcs.service0.177855.0x562aca339120.queueing (i32) = 0
> stats.ipcs.service0.177855.0x562aca339120.recv_retries (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.requests (u64) = 23
> stats.ipcs.service0.177855.0x562aca339120.responses (u64) = 24
> stats.ipcs.service0.177855.0x562aca339120.send_retries (u64) = 0
> stats.ipcs.service0.177855.0x562aca339120.sent (u32) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.dispatched (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.flow_control (u32) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.flow_control_count (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.invalid_request (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.overload (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.procname (str) = corosync-cmapct
> stats.ipcs.service0.177856.0x562acaa58fc0.queued (u32) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.queueing (i32) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.recv_retries (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.requests (u64) = 146
> stats.ipcs.service0.177856.0x562acaa58fc0.responses (u64) = 146
> stats.ipcs.service0.177856.0x562acaa58fc0.send_retries (u64) = 0
> stats.ipcs.service0.177856.0x562acaa58fc0.sent (u32) = 0

When both our corosync-cmapctl processes happen to be running at the same time, the IPC service0 entries contain the stats for both. The cmapctl processes are seeing and reading each other's IPC connection stats. When process A happens to be in the middle of reading process B's stats and process B exits, corosync cleans up the connection stats before process A can finish reading them and it results in the above mentioned error message. But the corosync daemon does not crash now with this fix.

See also https://bugzilla.redhat.com/show_bug.cgi?id=1962139#c8 as the same applies here -- the remaining CS_ERR_NOT_EXIST errors are fine.