Bug 2042550
| Summary: | pacemaker.service fails to start if enabling corosync CFG shutdown tracker returns CS_ERR_TRY_AGAIN | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Ken Gaillot <kgaillot> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 8.6 | CC: | cluster-maint, cluster-qe, msmazova, nwahl, phagara, sbradley |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | 8.6 | Flags: | pm-rhel:
mirror+
|
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-2.1.2-3.el8 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: In some network configurations, Pacemaker can fail to register with Corosync's shutdown tracker on the first attempt.
Consequence: Pacemaker exits and is respawned repeatedly until systemd gives up, and manual intervention is required to return the node to the cluster.
Fix: Pacemaker now retries the connection if the first attempt fails.
Result: The node is able to rejoin the cluster automatically.
|
Story Points: | --- |
| Clone Of: | 2042367 | Environment: | |
| Last Closed: | 2022-05-10 14:10:03 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2042367 | ||
| Bug Blocks: | |||
|
Description
Ken Gaillot
2022-01-19 17:08:26 UTC
env: 2-node cluster with qdevice/qnetd in lms mode with a slow (eg. `sleep 10`) heuristics configured reproducer: 1. pcs cluster stop node2 --wait 2. pcs cluster start node2 --wait before fix (pacemaker-2.1.2-2.el8) ================================== > [root@virt-058 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f > Feb 01 13:19:22 virt-058 systemd[1]: Starting Corosync Cluster Engine... > Feb 01 13:19:22 virt-058 corosync[113248]: [MAIN ] Corosync Cluster Engine 3.1.5 starting up > Feb 01 13:19:22 virt-058 corosync[113248]: [MAIN ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow > Feb 01 13:19:22 virt-058 corosync[113248]: [TOTEM ] Initializing transport (Kronosnet). > Feb 01 13:19:22 virt-058 corosync[113248]: [TOTEM ] totemknet initialized > Feb 01 13:19:22 virt-058 corosync[113248]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so > Feb 01 13:19:22 virt-058 corosync[113248]: [SERV ] Service engine loaded: corosync configuration map access [0] > Feb 01 13:19:22 virt-058 corosync[113248]: [QB ] server name: cmap > Feb 01 13:19:22 virt-058 corosync[113248]: [SERV ] Service engine loaded: corosync configuration service [1] > Feb 01 13:19:22 virt-058 corosync[113248]: [QB ] server name: cfg > Feb 01 13:19:22 virt-058 corosync[113248]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] > Feb 01 13:19:22 virt-058 corosync[113248]: [QB ] server name: cpg > Feb 01 13:19:22 virt-058 corosync[113248]: [SERV ] Service engine loaded: corosync profile loading service [4] > Feb 01 13:19:22 virt-058 corosync[113248]: [QUORUM] Using quorum provider corosync_votequorum > Feb 01 13:19:22 virt-058 corosync[113248]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] > Feb 01 13:19:22 virt-058 corosync[113248]: [QB ] server name: votequorum > Feb 01 13:19:22 virt-058 corosync[113248]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] > Feb 01 13:19:22 virt-058 corosync[113248]: [QB ] server name: quorum > Feb 01 13:19:22 virt-058 corosync[113248]: [TOTEM ] Configuring link 0 > Feb 01 13:19:22 virt-058 corosync[113248]: [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:3a, port=5405 > Feb 01 13:19:22 virt-058 corosync[113248]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 13:19:22 virt-058 corosync[113248]: [KNET ] host: host: 1 has no active links > Feb 01 13:19:22 virt-058 corosync[113248]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 13:19:22 virt-058 corosync[113248]: [KNET ] host: host: 1 has no active links > Feb 01 13:19:22 virt-058 corosync[113248]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 13:19:22 virt-058 corosync[113248]: [KNET ] host: host: 1 has no active links > Feb 01 13:19:22 virt-058 corosync[113248]: [QUORUM] Sync members[1]: 2 > Feb 01 13:19:22 virt-058 corosync[113248]: [QUORUM] Sync joined[1]: 2 > Feb 01 13:19:22 virt-058 corosync[113248]: [TOTEM ] A new membership (2.43) was formed. Members joined: 2 > Feb 01 13:19:22 virt-058 corosync[113248]: [QUORUM] Members[1]: 2 > Feb 01 13:19:22 virt-058 corosync[113248]: [MAIN ] Completed service synchronization, ready to provide service. > Feb 01 13:19:22 virt-058 systemd[1]: Started Corosync Cluster Engine. > Feb 01 13:19:22 virt-058 systemd[1]: Starting Corosync Qdevice daemon... > Feb 01 13:19:24 virt-058 corosync[113248]: [KNET ] rx: host: 1 link: 0 is up > Feb 01 13:19:24 virt-058 corosync[113248]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 13:19:24 virt-058 corosync[113248]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365 > Feb 01 13:19:24 virt-058 corosync[113248]: [KNET ] pmtud: Global data MTU changed to: 1365 > Feb 01 13:19:24 virt-058 corosync[113248]: [QUORUM] Sync members[2]: 1 2 > Feb 01 13:19:24 virt-058 corosync[113248]: [QUORUM] Sync joined[1]: 1 > Feb 01 13:19:24 virt-058 corosync[113248]: [TOTEM ] A new membership (1.47) was formed. Members joined: 1 > Feb 01 13:19:35 virt-058 systemd[1]: Started Corosync Qdevice daemon. > Feb 01 13:19:35 virt-058 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 13:19:35 virt-058 pacemakerd[113270]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 13:19:35 virt-058 pacemakerd[113270]: notice: Starting Pacemaker 2.1.2-2.el8 > Feb 01 13:19:35 virt-058 pacemakerd[113270]: crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN > Feb 01 13:19:35 virt-058 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL > Feb 01 13:19:35 virt-058 systemd[1]: pacemaker.service: Failed with result 'exit-code'. > Feb 01 13:19:35 virt-058 systemd[1]: pacemaker.service: Service RestartSec=100ms expired, scheduling restart. > Feb 01 13:19:35 virt-058 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 1. > Feb 01 13:19:35 virt-058 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. > Feb 01 13:19:35 virt-058 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 13:19:35 virt-058 pacemakerd[113273]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 13:19:35 virt-058 pacemakerd[113273]: notice: Starting Pacemaker 2.1.2-2.el8 > Feb 01 13:19:35 virt-058 pacemakerd[113273]: crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN > Feb 01 13:19:35 virt-058 systemd[1]: pacemaker.service: Main process exited, code=exited, status=76/PROTOCOL > Feb 01 13:19:35 virt-058 systemd[1]: pacemaker.service: Failed with result 'exit-code'. > Feb 01 13:19:35 virt-058 corosync[113248]: [QUORUM] This node is within the primary component and will provide service. > Feb 01 13:19:35 virt-058 corosync[113248]: [QUORUM] Members[2]: 1 2 > Feb 01 13:19:35 virt-058 corosync[113248]: [MAIN ] Completed service synchronization, ready to provide service. > Feb 01 13:19:35 virt-058 systemd[1]: pacemaker.service: Service RestartSec=100ms expired, scheduling restart. > Feb 01 13:19:35 virt-058 systemd[1]: pacemaker.service: Scheduled restart job, restart counter is at 2. > Feb 01 13:19:35 virt-058 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. > Feb 01 13:19:35 virt-058 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 13:19:35 virt-058 pacemakerd[113275]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 13:19:35 virt-058 pacemakerd[113275]: notice: Starting Pacemaker 2.1.2-2.el8 > Feb 01 13:19:35 virt-058 pacemakerd[113275]: notice: Pacemaker daemon successfully started and accepting connections Result: pacemaker.service fails 2 times with "crit: Could not enable Corosync CFG shutdown tracker: CS_ERR_TRY_AGAIN" while corosync is waiting for corosync to establish quorum. after fix (pacemaker-2.1.2-3.el8) ================================= > [root@virt-569 ~]# journalctl -u corosync -u corosync-qdevice -u pacemaker -f > Feb 01 13:39:47 virt-569 systemd[1]: Starting Corosync Cluster Engine... > Feb 01 13:39:47 virt-569 corosync[126292]: [MAIN ] Corosync Cluster Engine 3.1.5 starting up > Feb 01 13:39:47 virt-569 corosync[126292]: [MAIN ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow > Feb 01 13:39:47 virt-569 corosync[126292]: [TOTEM ] Initializing transport (Kronosnet). > Feb 01 13:39:47 virt-569 corosync[126292]: [TOTEM ] totemknet initialized > Feb 01 13:39:47 virt-569 corosync[126292]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so > Feb 01 13:39:47 virt-569 corosync[126292]: [SERV ] Service engine loaded: corosync configuration map access [0] > Feb 01 13:39:47 virt-569 corosync[126292]: [QB ] server name: cmap > Feb 01 13:39:47 virt-569 corosync[126292]: [SERV ] Service engine loaded: corosync configuration service [1] > Feb 01 13:39:47 virt-569 corosync[126292]: [QB ] server name: cfg > Feb 01 13:39:47 virt-569 corosync[126292]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] > Feb 01 13:39:47 virt-569 corosync[126292]: [QB ] server name: cpg > Feb 01 13:39:47 virt-569 corosync[126292]: [SERV ] Service engine loaded: corosync profile loading service [4] > Feb 01 13:39:47 virt-569 corosync[126292]: [QUORUM] Using quorum provider corosync_votequorum > Feb 01 13:39:47 virt-569 corosync[126292]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] > Feb 01 13:39:47 virt-569 corosync[126292]: [QB ] server name: votequorum > Feb 01 13:39:47 virt-569 corosync[126292]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] > Feb 01 13:39:47 virt-569 corosync[126292]: [QB ] server name: quorum > Feb 01 13:39:47 virt-569 corosync[126292]: [TOTEM ] Configuring link 0 > Feb 01 13:39:47 virt-569 corosync[126292]: [TOTEM ] Configured link number 0: local addr: 2620:52:0:25a4:1800:ff:fe00:239, port=5405 > Feb 01 13:39:47 virt-569 corosync[126292]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 13:39:47 virt-569 corosync[126292]: [KNET ] host: host: 1 has no active links > Feb 01 13:39:47 virt-569 corosync[126292]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 13:39:47 virt-569 corosync[126292]: [KNET ] host: host: 1 has no active links > Feb 01 13:39:47 virt-569 corosync[126292]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 13:39:47 virt-569 systemd[1]: Started Corosync Cluster Engine. > Feb 01 13:39:47 virt-569 corosync[126292]: [KNET ] host: host: 1 has no active links > Feb 01 13:39:47 virt-569 corosync[126292]: [QUORUM] Sync members[1]: 2 > Feb 01 13:39:47 virt-569 corosync[126292]: [QUORUM] Sync joined[1]: 2 > Feb 01 13:39:47 virt-569 corosync[126292]: [TOTEM ] A new membership (2.2d) was formed. Members joined: 2 > Feb 01 13:39:47 virt-569 corosync[126292]: [QUORUM] Members[1]: 2 > Feb 01 13:39:47 virt-569 corosync[126292]: [MAIN ] Completed service synchronization, ready to provide service. > Feb 01 13:39:47 virt-569 systemd[1]: Starting Corosync Qdevice daemon... > Feb 01 13:39:49 virt-569 corosync[126292]: [KNET ] rx: host: 1 link: 0 is up > Feb 01 13:39:49 virt-569 corosync[126292]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 01 13:39:49 virt-569 corosync[126292]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 1157 to 1365 > Feb 01 13:39:49 virt-569 corosync[126292]: [KNET ] pmtud: Global data MTU changed to: 1365 > Feb 01 13:39:50 virt-569 corosync[126292]: [QUORUM] Sync members[2]: 1 2 > Feb 01 13:39:50 virt-569 corosync[126292]: [QUORUM] Sync joined[1]: 1 > Feb 01 13:39:50 virt-569 corosync[126292]: [TOTEM ] A new membership (1.31) was formed. Members joined: 1 > Feb 01 13:40:00 virt-569 systemd[1]: Started Corosync Qdevice daemon. > Feb 01 13:40:00 virt-569 systemd[1]: Started Pacemaker High Availability Cluster Manager. > Feb 01 13:40:00 virt-569 pacemakerd[126314]: notice: Additional logging available in /var/log/pacemaker/pacemaker.log > Feb 01 13:40:00 virt-569 pacemakerd[126314]: notice: Starting Pacemaker 2.1.2-3.el8 > Feb 01 13:40:00 virt-569 corosync[126292]: [QUORUM] This node is within the primary component and will provide service. > Feb 01 13:40:00 virt-569 corosync[126292]: [QUORUM] Members[2]: 1 2 > Feb 01 13:40:00 virt-569 corosync[126292]: [MAIN ] Completed service synchronization, ready to provide service. > Feb 01 13:40:01 virt-569 pacemakerd[126314]: notice: Pacemaker daemon successfully started and accepting connections Result: pacemaker.service does not exit with a fatal error while corosync is establishing quorum, but patiently retries. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1885 |