Bug 2042367

Summary: pacemaker.service fails to start if enabling corosync CFG shutdown tracker returns CS_ERR_TRY_AGAIN
Product: Red Hat Enterprise Linux 9 Reporter: Patrik Hagara <phagara>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: high    
Version: 9.0CC: cluster-maint, msmazova, nwahl, sbradley
Target Milestone: rcKeywords: Triaged
Target Release: 9.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: pacemaker-2.1.2-3.el9 Doc Type: Bug Fix
Doc Text:
Cause: In some network configurations, Pacemaker can fail to register with Corosync's shutdown tracker on the first attempt. Consequence: Pacemaker exits and is respawned repeatedly until systemd gives up, and manual intervention is required to return the node to the cluster. Fix: Pacemaker now retries the connection if the first attempt fails. Result: The node is able to rejoin the cluster automatically.
Story Points: ---
Clone Of:
: 2042550 (view as bug list) Environment:
Last Closed: 2022-05-17 12:20:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2042550    

Comment 1 Ken Gaillot 2022-01-19 15:18:12 UTC
Fixed upstream as of commit e8bf0161

Comment 10 Patrik Hagara 2022-02-01 13:37:01 UTC
env: 2-node cluster with qdevice/qnetd in lms mode with a slow (eg. `sleep 10`) heuristics configured

reproducer:
1. pcs cluster stop node2 --wait
2. on node2: systemctl start corosync && sleep 3 && systemctl start pacemaker

(or just fence the node, if fencing is configured to reboot and cluster services to start on boot)


before fix (pacemaker-2.1.2-2.el9)
==================================

> [root@virt-273 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f
> Feb 01 14:18:29 virt-273 systemd[1]: Starting Corosync Cluster Engine...
> Feb 01 14:18:29 virt-273 corosync[3567]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
> Feb 01 14:18:29 virt-273 corosync[3567]:   [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] Initializing transport (Kronosnet).
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] totemknet initialized
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: cmap
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync configuration service [1]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: cfg
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: cpg
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Using quorum provider corosync_votequorum
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: votequorum
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: quorum
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] Configuring link 0
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:111, port=5405
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 2 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Sync members[1]: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Sync joined[1]: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] A new membership (2.6f) was formed. Members joined: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Members[1]: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 01 14:18:29 virt-273 systemd[1]: Started Corosync Cluster Engine.
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] pmtud: Global data MTU changed to: 1365
> Feb 01 14:18:31 virt-273 corosync[3567]:   [QUORUM] Sync members[2]: 1 2
> Feb 01 14:18:31 virt-273 corosync[3567]:   [QUORUM] Sync joined[1]: 1
> Feb 01 14:18:31 virt-273 corosync[3567]:   [TOTEM ] A new membership (1.73) was formed. Members joined: 1
> Feb 01 14:18:32 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:32 virt-273 pacemakerd[3577]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:32 virt-273 pacemakerd[3577]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:32 virt-273 pacemakerd[3577]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 1.
> Feb 01 14:18:32 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:32 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:32 virt-273 pacemakerd[3579]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:32 virt-273 pacemakerd[3579]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:32 virt-273 pacemakerd[3579]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 2.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 pacemakerd[3580]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:33 virt-273 pacemakerd[3580]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:33 virt-273 pacemakerd[3580]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 3.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 pacemakerd[3581]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:33 virt-273 pacemakerd[3581]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:33 virt-273 pacemakerd[3581]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 4.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 pacemakerd[3582]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:33 virt-273 pacemakerd[3582]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:33 virt-273 pacemakerd[3582]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 5.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Start request repeated too quickly.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: Failed to start Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:41 virt-273 corosync[3567]:   [QUORUM] This node is within the primary component and will provide service.
> Feb 01 14:18:41 virt-273 corosync[3567]:   [QUORUM] Members[2]: 1 2
> Feb 01 14:18:41 virt-273 corosync[3567]:   [MAIN  ] Completed service synchronization, ready to provide service.

Result: pacemaker.service repeatedly fails with CS_ERR_TRY_AGAIN immediately after a start attempt while corosync is in the process of establishing quorum (due to the slow heuristics script).


after fix (pacemaker-2.1.2-3.el9)
=================================

> [root@virt-007 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f
> Feb 01 14:34:42 virt-007 systemd[1]: Starting Corosync Cluster Engine...
> Feb 01 14:34:42 virt-007 corosync[129907]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
> Feb 01 14:34:42 virt-007 corosync[129907]:   [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
> Feb 01 14:34:42 virt-007 corosync[129907]:   [TOTEM ] Initializing transport (Kronosnet).
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] totemknet initialized
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: cmap
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync configuration service [1]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: cfg
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: cpg
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Using quorum provider corosync_votequorum
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: votequorum
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: quorum
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] Configuring link 0
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:7, port=5405
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Sync members[1]: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Sync joined[1]: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] A new membership (2.1b) was formed. Members joined: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Members[1]: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 01 14:34:43 virt-007 systemd[1]: Started Corosync Cluster Engine.
> Feb 01 14:34:44 virt-007 corosync[129907]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 01 14:34:44 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:45 virt-007 corosync[129907]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365
> Feb 01 14:34:45 virt-007 corosync[129907]:   [KNET  ] pmtud: Global data MTU changed to: 1365
> Feb 01 14:34:45 virt-007 corosync[129907]:   [QUORUM] Sync members[2]: 1 2
> Feb 01 14:34:45 virt-007 corosync[129907]:   [QUORUM] Sync joined[1]: 1
> Feb 01 14:34:45 virt-007 corosync[129907]:   [TOTEM ] A new membership (1.1f) was formed. Members joined: 1
> Feb 01 14:34:46 virt-007 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:34:46 virt-007 pacemakerd[129921]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:34:46 virt-007 pacemakerd[129921]:  notice: Starting Pacemaker 2.1.2-3.el9
> Feb 01 14:34:56 virt-007 corosync[129907]:   [QUORUM] This node is within the primary component and will provide service.
> Feb 01 14:34:56 virt-007 corosync[129907]:   [QUORUM] Members[2]: 1 2
> Feb 01 14:34:56 virt-007 corosync[129907]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 01 14:35:01 virt-007 pacemakerd[129921]:  notice: Pacemaker daemon successfully started and accepting connections

Result: pacemaker.service does not exit with a fatal error while corosync is establishing quorum, but patiently retries.

Comment 12 errata-xmlrpc 2022-05-17 12:20:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: pacemaker), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2293