Bug 2042367 - pacemaker.service fails to start if enabling corosync CFG shutdown tracker returns CS_ERR_TRY_AGAIN
Summary: pacemaker.service fails to start if enabling corosync CFG shutdown tracker re...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: pacemaker
Version: 9.0
Hardware: All
OS: All
high
medium
Target Milestone: rc
: 9.0
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 2042550
TreeView+ depends on / blocked
 
Reported: 2022-01-19 10:50 UTC by Patrik Hagara
Modified: 2023-07-25 18:16 UTC (History)
4 users (show)

Fixed In Version: pacemaker-2.1.2-3.el9
Doc Type: Bug Fix
Doc Text:
Cause: In some network configurations, Pacemaker can fail to register with Corosync's shutdown tracker on the first attempt. Consequence: Pacemaker exits and is respawned repeatedly until systemd gives up, and manual intervention is required to return the node to the cluster. Fix: Pacemaker now retries the connection if the first attempt fails. Result: The node is able to rejoin the cluster automatically.
Clone Of:
: 2042550 (view as bug list)
Environment:
Last Closed: 2022-05-17 12:20:40 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker KCSOPP-1855 0 None None None 2023-07-25 18:16:10 UTC
Red Hat Issue Tracker RHELPLAN-108695 0 None None None 2022-01-19 10:56:37 UTC
Red Hat Product Errata RHBA-2022:2293 0 None None None 2022-05-17 12:20:51 UTC

Comment 1 Ken Gaillot 2022-01-19 15:18:12 UTC
Fixed upstream as of commit e8bf0161

Comment 10 Patrik Hagara 2022-02-01 13:37:01 UTC
env: 2-node cluster with qdevice/qnetd in lms mode with a slow (eg. `sleep 10`) heuristics configured

reproducer:
1. pcs cluster stop node2 --wait
2. on node2: systemctl start corosync && sleep 3 && systemctl start pacemaker

(or just fence the node, if fencing is configured to reboot and cluster services to start on boot)


before fix (pacemaker-2.1.2-2.el9)
==================================

> [root@virt-273 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f
> Feb 01 14:18:29 virt-273 systemd[1]: Starting Corosync Cluster Engine...
> Feb 01 14:18:29 virt-273 corosync[3567]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
> Feb 01 14:18:29 virt-273 corosync[3567]:   [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] Initializing transport (Kronosnet).
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] totemknet initialized
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: cmap
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync configuration service [1]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: cfg
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: cpg
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Using quorum provider corosync_votequorum
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: votequorum
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: quorum
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] Configuring link 0
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:111, port=5405
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 2 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Sync members[1]: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Sync joined[1]: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] A new membership (2.6f) was formed. Members joined: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Members[1]: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 01 14:18:29 virt-273 systemd[1]: Started Corosync Cluster Engine.
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] pmtud: Global data MTU changed to: 1365
> Feb 01 14:18:31 virt-273 corosync[3567]:   [QUORUM] Sync members[2]: 1 2
> Feb 01 14:18:31 virt-273 corosync[3567]:   [QUORUM] Sync joined[1]: 1
> Feb 01 14:18:31 virt-273 corosync[3567]:   [TOTEM ] A new membership (1.73) was formed. Members joined: 1
> Feb 01 14:18:32 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:32 virt-273 pacemakerd[3577]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:32 virt-273 pacemakerd[3577]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:32 virt-273 pacemakerd[3577]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 1.
> Feb 01 14:18:32 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:32 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:32 virt-273 pacemakerd[3579]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:32 virt-273 pacemakerd[3579]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:32 virt-273 pacemakerd[3579]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 2.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 pacemakerd[3580]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:33 virt-273 pacemakerd[3580]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:33 virt-273 pacemakerd[3580]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 3.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 pacemakerd[3581]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:33 virt-273 pacemakerd[3581]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:33 virt-273 pacemakerd[3581]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 4.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 pacemakerd[3582]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:33 virt-273 pacemakerd[3582]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:33 virt-273 pacemakerd[3582]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 5.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Start request repeated too quickly.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: Failed to start Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:41 virt-273 corosync[3567]:   [QUORUM] This node is within the primary component and will provide service.
> Feb 01 14:18:41 virt-273 corosync[3567]:   [QUORUM] Members[2]: 1 2
> Feb 01 14:18:41 virt-273 corosync[3567]:   [MAIN  ] Completed service synchronization, ready to provide service.

Result: pacemaker.service repeatedly fails with CS_ERR_TRY_AGAIN immediately after a start attempt while corosync is in the process of establishing quorum (due to the slow heuristics script).


after fix (pacemaker-2.1.2-3.el9)
=================================

> [root@virt-007 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f
> Feb 01 14:34:42 virt-007 systemd[1]: Starting Corosync Cluster Engine...
> Feb 01 14:34:42 virt-007 corosync[129907]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
> Feb 01 14:34:42 virt-007 corosync[129907]:   [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
> Feb 01 14:34:42 virt-007 corosync[129907]:   [TOTEM ] Initializing transport (Kronosnet).
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] totemknet initialized
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: cmap
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync configuration service [1]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: cfg
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: cpg
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Using quorum provider corosync_votequorum
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: votequorum
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: quorum
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] Configuring link 0
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:7, port=5405
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Sync members[1]: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Sync joined[1]: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] A new membership (2.1b) was formed. Members joined: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Members[1]: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 01 14:34:43 virt-007 systemd[1]: Started Corosync Cluster Engine.
> Feb 01 14:34:44 virt-007 corosync[129907]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 01 14:34:44 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:45 virt-007 corosync[129907]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365
> Feb 01 14:34:45 virt-007 corosync[129907]:   [KNET  ] pmtud: Global data MTU changed to: 1365
> Feb 01 14:34:45 virt-007 corosync[129907]:   [QUORUM] Sync members[2]: 1 2
> Feb 01 14:34:45 virt-007 corosync[129907]:   [QUORUM] Sync joined[1]: 1
> Feb 01 14:34:45 virt-007 corosync[129907]:   [TOTEM ] A new membership (1.1f) was formed. Members joined: 1
> Feb 01 14:34:46 virt-007 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:34:46 virt-007 pacemakerd[129921]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:34:46 virt-007 pacemakerd[129921]:  notice: Starting Pacemaker 2.1.2-3.el9
> Feb 01 14:34:56 virt-007 corosync[129907]:   [QUORUM] This node is within the primary component and will provide service.
> Feb 01 14:34:56 virt-007 corosync[129907]:   [QUORUM] Members[2]: 1 2
> Feb 01 14:34:56 virt-007 corosync[129907]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 01 14:35:01 virt-007 pacemakerd[129921]:  notice: Pacemaker daemon successfully started and accepting connections

Result: pacemaker.service does not exit with a fatal error while corosync is establishing quorum, but patiently retries.

Comment 12 errata-xmlrpc 2022-05-17 12:20:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: pacemaker), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2293


Note You need to log in before you can comment on or make changes to this bug.