2042367 – pacemaker.service fails to start if enabling corosync CFG shutdown tracker returns CS_ERR_TRY_AGAIN

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2042367 - pacemaker.service fails to start if enabling corosync CFG shutdown tracker returns CS_ERR_TRY_AGAIN

Summary: pacemaker.service fails to start if enabling corosync CFG shutdown tracker re...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	9.0
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	9.0
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2042550
TreeView+	depends on / blocked

Reported:	2022-01-19 10:50 UTC by Patrik Hagara
Modified:	2023-07-25 18:16 UTC (History)
CC List:	4 users (show)
Fixed In Version:	pacemaker-2.1.2-3.el9
Doc Type:	Bug Fix
Doc Text:	Cause: In some network configurations, Pacemaker can fail to register with Corosync's shutdown tracker on the first attempt. Consequence: Pacemaker exits and is respawned repeatedly until systemd gives up, and manual intervention is required to return the node to the cluster. Fix: Pacemaker now retries the connection if the first attempt fails. Result: The node is able to rejoin the cluster automatically.
Clone Of:
Clones:	2042550 (view as bug list)
Environment:
Last Closed:	2022-05-17 12:20:40 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	KCSOPP-1855	None	None	None	2023-07-25 18:16:10 UTC
Red Hat Issue Tracker	RHELPLAN-108695	None	None	None	2022-01-19 10:56:37 UTC
Red Hat Product Errata	RHBA-2022:2293	None	None	None	2022-05-17 12:20:51 UTC

Comment 1 Ken Gaillot 2022-01-19 15:18:12 UTC

Fixed upstream as of commit e8bf0161

Comment 10 Patrik Hagara 2022-02-01 13:37:01 UTC

env: 2-node cluster with qdevice/qnetd in lms mode with a slow (eg. `sleep 10`) heuristics configured

reproducer:
1. pcs cluster stop node2 --wait
2. on node2: systemctl start corosync && sleep 3 && systemctl start pacemaker

(or just fence the node, if fencing is configured to reboot and cluster services to start on boot)


before fix (pacemaker-2.1.2-2.el9)
==================================

> [root@virt-273 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f
> Feb 01 14:18:29 virt-273 systemd[1]: Starting Corosync Cluster Engine...
> Feb 01 14:18:29 virt-273 corosync[3567]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
> Feb 01 14:18:29 virt-273 corosync[3567]:   [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] Initializing transport (Kronosnet).
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] totemknet initialized
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: cmap
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync configuration service [1]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: cfg
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: cpg
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Using quorum provider corosync_votequorum
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: votequorum
> Feb 01 14:18:29 virt-273 corosync[3567]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QB    ] server name: quorum
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] Configuring link 0
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:111, port=5405
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
> Feb 01 14:18:29 virt-273 corosync[3567]:   [KNET  ] host: host: 2 has no active links
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Sync members[1]: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Sync joined[1]: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [TOTEM ] A new membership (2.6f) was formed. Members joined: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [QUORUM] Members[1]: 2
> Feb 01 14:18:29 virt-273 corosync[3567]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 01 14:18:29 virt-273 systemd[1]: Started Corosync Cluster Engine.
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365
> Feb 01 14:18:31 virt-273 corosync[3567]:   [KNET  ] pmtud: Global data MTU changed to: 1365
> Feb 01 14:18:31 virt-273 corosync[3567]:   [QUORUM] Sync members[2]: 1 2
> Feb 01 14:18:31 virt-273 corosync[3567]:   [QUORUM] Sync joined[1]: 1
> Feb 01 14:18:31 virt-273 corosync[3567]:   [TOTEM ] A new membership (1.73) was formed. Members joined: 1
> Feb 01 14:18:32 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:32 virt-273 pacemakerd[3577]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:32 virt-273 pacemakerd[3577]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:32 virt-273 pacemakerd[3577]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 1.
> Feb 01 14:18:32 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:32 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:32 virt-273 pacemakerd[3579]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:32 virt-273 pacemakerd[3579]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:32 virt-273 pacemakerd[3579]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 2.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 pacemakerd[3580]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:33 virt-273 pacemakerd[3580]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:33 virt-273 pacemakerd[3580]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 3.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 pacemakerd[3581]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:33 virt-273 pacemakerd[3581]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:33 virt-273 pacemakerd[3581]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 4.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 pacemakerd[3582]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:18:33 virt-273 pacemakerd[3582]:  notice: Starting Pacemaker 2.1.2-2.el9
> Feb 01 14:18:33 virt-273 pacemakerd[3582]:  crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 5.
> Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Start request repeated too quickly.
> Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'.
> Feb 01 14:18:33 virt-273 systemd[1]: Failed to start Pacemaker High Availability Cluster Manager.
> Feb 01 14:18:41 virt-273 corosync[3567]:   [QUORUM] This node is within the primary component and will provide service.
> Feb 01 14:18:41 virt-273 corosync[3567]:   [QUORUM] Members[2]: 1 2
> Feb 01 14:18:41 virt-273 corosync[3567]:   [MAIN  ] Completed service synchronization, ready to provide service.

Result: pacemaker.service repeatedly fails with CS_ERR_TRY_AGAIN immediately after a start attempt while corosync is in the process of establishing quorum (due to the slow heuristics script).


after fix (pacemaker-2.1.2-3.el9)
=================================

> [root@virt-007 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f
> Feb 01 14:34:42 virt-007 systemd[1]: Starting Corosync Cluster Engine...
> Feb 01 14:34:42 virt-007 corosync[129907]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
> Feb 01 14:34:42 virt-007 corosync[129907]:   [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
> Feb 01 14:34:42 virt-007 corosync[129907]:   [TOTEM ] Initializing transport (Kronosnet).
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] totemknet initialized
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: cmap
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync configuration service [1]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: cfg
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: cpg
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Using quorum provider corosync_votequorum
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: votequorum
> Feb 01 14:34:43 virt-007 corosync[129907]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QB    ] server name: quorum
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] Configuring link 0
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:7, port=5405
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:43 virt-007 corosync[129907]:   [KNET  ] host: host: 1 has no active links
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Sync members[1]: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Sync joined[1]: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [TOTEM ] A new membership (2.1b) was formed. Members joined: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [QUORUM] Members[1]: 2
> Feb 01 14:34:43 virt-007 corosync[129907]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 01 14:34:43 virt-007 systemd[1]: Started Corosync Cluster Engine.
> Feb 01 14:34:44 virt-007 corosync[129907]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 01 14:34:44 virt-007 corosync[129907]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 01 14:34:45 virt-007 corosync[129907]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365
> Feb 01 14:34:45 virt-007 corosync[129907]:   [KNET  ] pmtud: Global data MTU changed to: 1365
> Feb 01 14:34:45 virt-007 corosync[129907]:   [QUORUM] Sync members[2]: 1 2
> Feb 01 14:34:45 virt-007 corosync[129907]:   [QUORUM] Sync joined[1]: 1
> Feb 01 14:34:45 virt-007 corosync[129907]:   [TOTEM ] A new membership (1.1f) was formed. Members joined: 1
> Feb 01 14:34:46 virt-007 systemd[1]: Started Pacemaker High Availability Cluster Manager.
> Feb 01 14:34:46 virt-007 pacemakerd[129921]:  notice: Additional logging available in /var/log/pacemaker/pacemaker.log
> Feb 01 14:34:46 virt-007 pacemakerd[129921]:  notice: Starting Pacemaker 2.1.2-3.el9
> Feb 01 14:34:56 virt-007 corosync[129907]:   [QUORUM] This node is within the primary component and will provide service.
> Feb 01 14:34:56 virt-007 corosync[129907]:   [QUORUM] Members[2]: 1 2
> Feb 01 14:34:56 virt-007 corosync[129907]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 01 14:35:01 virt-007 pacemakerd[129921]:  notice: Pacemaker daemon successfully started and accepting connections

Result: pacemaker.service does not exit with a fatal error while corosync is establishing quorum, but patiently retries.

Comment 12 errata-xmlrpc 2022-05-17 12:20:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: pacemaker), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2293

Note You need to log in before you can comment on or make changes to this bug.