Bug 2042367
| Summary: | pacemaker.service fails to start if enabling corosync CFG shutdown tracker returns CS_ERR_TRY_AGAIN | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Patrik Hagara <phagara> | |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | high | |||
| Version: | 9.0 | CC: | cluster-maint, msmazova, nwahl, sbradley | |
| Target Milestone: | rc | Keywords: | Triaged | |
| Target Release: | 9.0 | |||
| Hardware: | All | |||
| OS: | All | |||
| Whiteboard: | ||||
| Fixed In Version: | pacemaker-2.1.2-3.el9 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: In some network configurations, Pacemaker can fail to register with Corosync's shutdown tracker on the first attempt.
Consequence: Pacemaker exits and is respawned repeatedly until systemd gives up, and manual intervention is required to return the node to the cluster.
Fix: Pacemaker now retries the connection if the first attempt fails.
Result: The node is able to rejoin the cluster automatically.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2042550 (view as bug list) | Environment: | ||
| Last Closed: | 2022-05-17 12:20:40 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2042550 | |||
|
Comment 1
Ken Gaillot
2022-01-19 15:18:12 UTC
env: 2-node cluster with qdevice/qnetd in lms mode with a slow (eg. `sleep 10`) heuristics configured reproducer: 1. pcs cluster stop node2 --wait 2. on node2: systemctl start corosync && sleep 3 && systemctl start pacemaker (or just fence the node, if fencing is configured to reboot and cluster services to start on boot) before fix (pacemaker-2.1.2-2.el9) ================================== > [root@virt-273 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f > Feb 01 14:18:29 virt-273 systemd[1]: Starting Corosync Cluster Engine... > Feb 01 14:18:29 virt-273 corosync[3567]: [MAIN ] Corosync Cluster Engine 3.1.5 starting up > Feb 01 14:18:29 virt-273 corosync[3567]: [MAIN ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow > Feb 01 14:18:29 virt-273 corosync[3567]: [TOTEM ] Initializing transport (Kronosnet). > Feb 01 14:18:29 virt-273 corosync[3567]: [TOTEM ] totemknet initialized > Feb 01 14:18:29 virt-273 corosync[3567]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so > Feb 01 14:18:29 virt-273 corosync[3567]: [SERV ] Service engine loaded: corosync configuration map access [0] > Feb 01 14:18:29 virt-273 corosync[3567]: [QB ] server name: cmap > Feb 01 14:18:29 virt-273 corosync[3567]: [SERV ] Service engine loaded: corosync configuration service [1] > Feb 01 14:18:29 virt-273 corosync[3567]: [QB ] server name: cfg > Feb 01 14:18:29 virt-273 corosync[3567]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] > Feb 01 14:18:29 virt-273 corosync[3567]: [QB ] server name: cpg > Feb 01 14:18:29 virt-273 corosync[3567]: [SERV ] Service engine loaded: corosync profile loading service [4] > Feb 01 14:18:29 virt-273 corosync[3567]: [QUORUM] Using quorum provider corosync_votequorum > Feb 01 14:18:29 virt-273 corosync[3567]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] > Feb 01 14:18:29 virt-273 corosync[3567]: [QB ] server name: votequorum > Feb 01 14:18:29 virt-273 corosync[3567]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] > Feb 01 14:18:29 virt-273 corosync[3567]: [QB ] server name: quorum > Feb 01 14:18:29 virt-273 corosync[3567]: [TOTEM ] Configuring link 0 > Feb 01 14:18:29 virt-273 corosync[3567]: [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:111, port=5405 > Feb 01 14:18:29 virt-273 corosync[3567]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 14:18:29 virt-273 corosync[3567]: [KNET ] host: host: 1 has no active links > Feb 01 14:18:29 virt-273 corosync[3567]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 14:18:29 virt-273 corosync[3567]: [KNET ] host: host: 1 has no active links > Feb 01 14:18:29 virt-273 corosync[3567]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 14:18:29 virt-273 corosync[3567]: [KNET ] host: host: 1 has no active links > Feb 01 14:18:29 virt-273 corosync[3567]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 0) > Feb 01 14:18:29 virt-273 corosync[3567]: [KNET ] host: host: 2 has no active links > Feb 01 14:18:29 virt-273 corosync[3567]: [QUORUM] Sync members[1]: 2 > Feb 01 14:18:29 virt-273 corosync[3567]: [QUORUM] Sync joined[1]: 2 > Feb 01 14:18:29 virt-273 corosync[3567]: [TOTEM ] A new membership (2.6f) was formed. Members joined: 2 > Feb 01 14:18:29 virt-273 corosync[3567]: [QUORUM] Members[1]: 2 > Feb 01 14:18:29 virt-273 corosync[3567]: [MAIN ] Completed service synchronization, ready to provide service. > Feb 01 14:18:29 virt-273 systemd[1]: Started Corosync Cluster Engine. > Feb 01 14:18:31 virt-273 corosync[3567]: [KNET ] rx: host: 1 link: 0 is up > Feb 01 14:18:31 virt-273 corosync[3567]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 14:18:31 virt-273 corosync[3567]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365 > Feb 01 14:18:31 virt-273 corosync[3567]: [KNET ] pmtud: Global data MTU changed to: 1365 > Feb 01 14:18:31 virt-273 corosync[3567]: [QUORUM] Sync members[2]: 1 2 > Feb 01 14:18:31 virt-273 corosync[3567]: [QUORUM] Sync joined[1]: 1 > Feb 01 14:18:31 virt-273 corosync[3567]: [TOTEM ] A new membership (1.73) was formed. Members joined: 1 > Feb 01 14:18:32 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 14:18:32 virt-273 pacemakerd[3577]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 14:18:32 virt-273 pacemakerd[3577]: notice: Starting Pacemaker 2.1.2-2.el9 > Feb 01 14:18:32 virt-273 pacemakerd[3577]: crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN > Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL > Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'. > Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 1. > Feb 01 14:18:32 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. > Feb 01 14:18:32 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 14:18:32 virt-273 pacemakerd[3579]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 14:18:32 virt-273 pacemakerd[3579]: notice: Starting Pacemaker 2.1.2-2.el9 > Feb 01 14:18:32 virt-273 pacemakerd[3579]: crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN > Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL > Feb 01 14:18:32 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'. > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 2. > Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. > Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 14:18:33 virt-273 pacemakerd[3580]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 14:18:33 virt-273 pacemakerd[3580]: notice: Starting Pacemaker 2.1.2-2.el9 > Feb 01 14:18:33 virt-273 pacemakerd[3580]: crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'. > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 3. > Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. > Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 14:18:33 virt-273 pacemakerd[3581]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 14:18:33 virt-273 pacemakerd[3581]: notice: Starting Pacemaker 2.1.2-2.el9 > Feb 01 14:18:33 virt-273 pacemakerd[3581]: crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'. > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 4. > Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. > Feb 01 14:18:33 virt-273 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 14:18:33 virt-273 pacemakerd[3582]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 14:18:33 virt-273 pacemakerd[3582]: notice: Starting Pacemaker 2.1.2-2.el9 > Feb 01 14:18:33 virt-273 pacemakerd[3582]: crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'. > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 5. > Feb 01 14:18:33 virt-273 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Start request repeated too quickly. > Feb 01 14:18:33 virt-273 systemd[1]: pacemaker.service: Failed with result 'exit-code'. > Feb 01 14:18:33 virt-273 systemd[1]: Failed to start Pacemaker High Availability Cluster Manager. > Feb 01 14:18:41 virt-273 corosync[3567]: [QUORUM] This node is within the primary component and will provide service. > Feb 01 14:18:41 virt-273 corosync[3567]: [QUORUM] Members[2]: 1 2 > Feb 01 14:18:41 virt-273 corosync[3567]: [MAIN ] Completed service synchronization, ready to provide service. Result: pacemaker.service repeatedly fails with CS_ERR_TRY_AGAIN immediately after a start attempt while corosync is in the process of establishing quorum (due to the slow heuristics script). after fix (pacemaker-2.1.2-3.el9) ================================= > [root@virt-007 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f > Feb 01 14:34:42 virt-007 systemd[1]: Starting Corosync Cluster Engine... > Feb 01 14:34:42 virt-007 corosync[129907]: [MAIN ] Corosync Cluster Engine 3.1.5 starting up > Feb 01 14:34:42 virt-007 corosync[129907]: [MAIN ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow > Feb 01 14:34:42 virt-007 corosync[129907]: [TOTEM ] Initializing transport (Kronosnet). > Feb 01 14:34:43 virt-007 corosync[129907]: [TOTEM ] totemknet initialized > Feb 01 14:34:43 virt-007 corosync[129907]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so > Feb 01 14:34:43 virt-007 corosync[129907]: [SERV ] Service engine loaded: corosync configuration map access [0] > Feb 01 14:34:43 virt-007 corosync[129907]: [QB ] server name: cmap > Feb 01 14:34:43 virt-007 corosync[129907]: [SERV ] Service engine loaded: corosync configuration service [1] > Feb 01 14:34:43 virt-007 corosync[129907]: [QB ] server name: cfg > Feb 01 14:34:43 virt-007 corosync[129907]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] > Feb 01 14:34:43 virt-007 corosync[129907]: [QB ] server name: cpg > Feb 01 14:34:43 virt-007 corosync[129907]: [SERV ] Service engine loaded: corosync profile loading service [4] > Feb 01 14:34:43 virt-007 corosync[129907]: [QUORUM] Using quorum provider corosync_votequorum > Feb 01 14:34:43 virt-007 corosync[129907]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] > Feb 01 14:34:43 virt-007 corosync[129907]: [QB ] server name: votequorum > Feb 01 14:34:43 virt-007 corosync[129907]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] > Feb 01 14:34:43 virt-007 corosync[129907]: [QB ] server name: quorum > Feb 01 14:34:43 virt-007 corosync[129907]: [TOTEM ] Configuring link 0 > Feb 01 14:34:43 virt-007 corosync[129907]: [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:7, port=5405 > Feb 01 14:34:43 virt-007 corosync[129907]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 14:34:43 virt-007 corosync[129907]: [KNET ] host: host: 1 has no active links > Feb 01 14:34:43 virt-007 corosync[129907]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 14:34:43 virt-007 corosync[129907]: [KNET ] host: host: 1 has no active links > Feb 01 14:34:43 virt-007 corosync[129907]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 14:34:43 virt-007 corosync[129907]: [KNET ] host: host: 1 has no active links > Feb 01 14:34:43 virt-007 corosync[129907]: [QUORUM] Sync members[1]: 2 > Feb 01 14:34:43 virt-007 corosync[129907]: [QUORUM] Sync joined[1]: 2 > Feb 01 14:34:43 virt-007 corosync[129907]: [TOTEM ] A new membership (2.1b) was formed. Members joined: 2 > Feb 01 14:34:43 virt-007 corosync[129907]: [QUORUM] Members[1]: 2 > Feb 01 14:34:43 virt-007 corosync[129907]: [MAIN ] Completed service synchronization, ready to provide service. > Feb 01 14:34:43 virt-007 systemd[1]: Started Corosync Cluster Engine. > Feb 01 14:34:44 virt-007 corosync[129907]: [KNET ] rx: host: 1 link: 0 is up > Feb 01 14:34:44 virt-007 corosync[129907]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 14:34:45 virt-007 corosync[129907]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365 > Feb 01 14:34:45 virt-007 corosync[129907]: [KNET ] pmtud: Global data MTU changed to: 1365 > Feb 01 14:34:45 virt-007 corosync[129907]: [QUORUM] Sync members[2]: 1 2 > Feb 01 14:34:45 virt-007 corosync[129907]: [QUORUM] Sync joined[1]: 1 > Feb 01 14:34:45 virt-007 corosync[129907]: [TOTEM ] A new membership (1.1f) was formed. Members joined: 1 > Feb 01 14:34:46 virt-007 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 14:34:46 virt-007 pacemakerd[129921]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 14:34:46 virt-007 pacemakerd[129921]: notice: Starting Pacemaker 2.1.2-3.el9 > Feb 01 14:34:56 virt-007 corosync[129907]: [QUORUM] This node is within the primary component and will provide service. > Feb 01 14:34:56 virt-007 corosync[129907]: [QUORUM] Members[2]: 1 2 > Feb 01 14:34:56 virt-007 corosync[129907]: [MAIN ] Completed service synchronization, ready to provide service. > Feb 01 14:35:01 virt-007 pacemakerd[129921]: notice: Pacemaker daemon successfully started and accepting connections Result: pacemaker.service does not exit with a fatal error while corosync is establishing quorum, but patiently retries. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: pacemaker), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2293 |