Bug 1967485
| Summary: | Interleaved stats iterators can cause corosync to crash | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Jan Friesse <jfriesse> |
| Component: | corosync | Assignee: | Jan Friesse <jfriesse> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | CentOS Stream | CC: | bstinson, ccaulfie, cluster-maint, cluster-qe, jfriesse, jwboyer, kgaillot, phagara |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | corosync-3.1.4-1.el9 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1962139 | Environment: | |
| Last Closed: | 2021-12-07 21:30:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1962139 | ||
| Bug Blocks: | |||
|
Description
Jan Friesse
2021-06-03 08:22:31 UTC
See https://bugzilla.redhat.com/show_bug.cgi?id=1962139#c7 for before-the-fix reproducer. after fix ========= > [root@virt-517 ~]# rpm -q corosync > corosync-3.1.5-1.el9.x86_64 Start these 3 commands in parallel: * journalctl -f -u corosync * while corosync-cmapctl -mstats; do continue; done > /dev/null * while corosync-cmapctl -mstats; do continue; done > /dev/null Result: corosync does not crash or log any errors (even after letting the corosync-cmapctl processes loop for a few minutes), but both corosync-cmapctl processes occasionally print non-fatal errors (exit code is always zero/success) such as: > Can't get value of stats.ipcs.service0.121900.0x562acaa5c070.sent. Error CS_ERR_NOT_EXIST It's always the IPC service0 in these error messages. Examining the full output reveals the following: > [root@virt-517 ~]# corosync-cmapctl -mstats stats.ipcs.service0 > stats.ipcs.service0.177855.0x562aca339120.dispatched (u64) = 0 > stats.ipcs.service0.177855.0x562aca339120.flow_control (u32) = 0 > stats.ipcs.service0.177855.0x562aca339120.flow_control_count (u64) = 0 > stats.ipcs.service0.177855.0x562aca339120.invalid_request (u64) = 0 > stats.ipcs.service0.177855.0x562aca339120.overload (u64) = 0 > stats.ipcs.service0.177855.0x562aca339120.procname (str) = corosync-cmapct > stats.ipcs.service0.177855.0x562aca339120.queued (u32) = 0 > stats.ipcs.service0.177855.0x562aca339120.queueing (i32) = 0 > stats.ipcs.service0.177855.0x562aca339120.recv_retries (u64) = 0 > stats.ipcs.service0.177855.0x562aca339120.requests (u64) = 23 > stats.ipcs.service0.177855.0x562aca339120.responses (u64) = 24 > stats.ipcs.service0.177855.0x562aca339120.send_retries (u64) = 0 > stats.ipcs.service0.177855.0x562aca339120.sent (u32) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.dispatched (u64) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.flow_control (u32) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.flow_control_count (u64) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.invalid_request (u64) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.overload (u64) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.procname (str) = corosync-cmapct > stats.ipcs.service0.177856.0x562acaa58fc0.queued (u32) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.queueing (i32) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.recv_retries (u64) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.requests (u64) = 146 > stats.ipcs.service0.177856.0x562acaa58fc0.responses (u64) = 146 > stats.ipcs.service0.177856.0x562acaa58fc0.send_retries (u64) = 0 > stats.ipcs.service0.177856.0x562acaa58fc0.sent (u32) = 0 When both our corosync-cmapctl processes happen to be running at the same time, the IPC service0 entries contain the stats for both. The cmapctl processes are seeing and reading each other's IPC connection stats. When process A happens to be in the middle of reading process B's stats and process B exits, corosync cleans up the connection stats before process A can finish reading them and it results in the above mentioned error message. But the corosync daemon does not crash now with this fix. See also https://bugzilla.redhat.com/show_bug.cgi?id=1962139#c8 as the same applies here -- the remaining CS_ERR_NOT_EXIST errors are fine. |