Bug 1989292
Summary: | On a three-node cluster if two nodes are hard-reset, sometimes the cluster ends up with unremovable transient attributes | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Ken Gaillot <kgaillot> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | cluster-qe <cluster-qe> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 9.0 | CC: | cluster-maint, cluster-qe, eolivare, jeckersb, lmiccini, michele, msmazova, pkomarov |
Target Milestone: | beta | Keywords: | Triaged |
Target Release: | 9.0 Beta | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-2.1.0-9.el9 | Doc Type: | Bug Fix |
Doc Text: |
Cause: If the DC and another node leave the cluster at the same time, either node might be listed first in the notification from Corosync, and Pacemaker will process them in order.
Consequence: If the non-DC node is listed and processed first, its transient node attributes will not be cleared, leading to potential problems with resource agents or unfencing.
Fix: Pacemaker now sorts the Corosync notification so that the DC node is always first.
Result: Transient attributes are properly cleared when a node leaves the cluster, even if the DC leaves at the same time.
|
Story Points: | --- |
Clone Of: | 1986998 | Environment: | |
Last Closed: | 2021-12-07 21:57:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | 2.1.2 |
Embargoed: | |||
Bug Depends On: | 1986998 | ||
Bug Blocks: | 1983952 |
Description
Ken Gaillot
2021-08-02 20:39:31 UTC
Fixed upstream as of commit ee7eba6 QA: RHOSP QA will test the 8.4.z equivalent of this bz, so this bz can be tested for regressions only. If you do want to reproduce it, it's straightforward: 1. Configure a cluster with at least 5 nodes (so quorum is retained if 2 are lost). 2. Choose two nodes: the DC node and a node with a lower Corosync node ID (if the DC has the lowest ID, just restart the cluster on that node, and another node will be elected DC). 3. Set a transient attribute on the non-DC node that was selected. 4. Kill both the nodes, and wait for the cluster to fence them. Before this change, the transient attribute on the non-DC node will persist across the reboot of that node. After this change, it will not. Since this was originally a RHOSP-related bz, RHOSP has verified the corresponding 8.4.z Bug 1989622, and this can get sanity-only testing after fix ---------- > [root@virt-495 ~]# rpm -q pacemaker > pacemaker-2.1.0-11.el9.x86_64 Setup 5 node cluster: > [root@virt-495 ~]# pcs status > Cluster name: STSRHTS2503 > Cluster Summary: > * Stack: corosync > * Current DC: virt-492 (version 2.1.0-11.el9-7c3f660707) - partition with quorum > * Last updated: Wed Sep 22 15:59:31 2021 > * Last change: Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487 > * 5 nodes configured > * 5 resource instances configured > Node List: > * Online: [ virt-487 virt-492 virt-493 virt-494 virt-495 ] > Full List of Resources: > * fence-virt-487 (stonith:fence_xvm): Started virt-487 > * fence-virt-492 (stonith:fence_xvm): Started virt-492 > * fence-virt-493 (stonith:fence_xvm): Started virt-493 > * fence-virt-494 (stonith:fence_xvm): Started virt-494 > * fence-virt-495 (stonith:fence_xvm): Started virt-495 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled Create transient attribute on node (virt-487) that has lower Corosync node ID than DC node (virt-492) : > [root@virt-495 ~]# crm_attribute --node virt-487 --name test_attribute --update test_1 --lifetime=reboot > [root@virt-495 ~]# pcs status --full > Cluster name: STSRHTS2503 > Cluster Summary: > * Stack: corosync > * Current DC: virt-492 (2) (version 2.1.0-11.el9-7c3f660707) - partition with quorum > * Last updated: Wed Sep 22 16:05:57 2021 > * Last change: Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487 > * 5 nodes configured > * 5 resource instances configured > Node List: > * Online: [ virt-487 (1) virt-492 (2) virt-493 (3) virt-494 (4) virt-495 (5) ] > Full List of Resources: > * fence-virt-487 (stonith:fence_xvm): Started virt-487 > * fence-virt-492 (stonith:fence_xvm): Started virt-492 > * fence-virt-493 (stonith:fence_xvm): Started virt-493 > * fence-virt-494 (stonith:fence_xvm): Started virt-494 > * fence-virt-495 (stonith:fence_xvm): Started virt-495 > Node Attributes: > * Node: virt-487 (1): > * test_attribute : test_1 > Migration Summary: > Tickets: > PCSD Status: > virt-487: Online > virt-492: Online > virt-493: Online > virt-494: Online > virt-495: Online > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Check for transient attributes on node virt-487: > [root@virt-495 ~]# cibadmin --query --xpath '/cib/status/node_state[@uname="virt-487"]/transient_attributes' > <transient_attributes id="1"> > <instance_attributes id="status-1"> > <nvpair id="status-1-test_attribute" name="test_attribute" value="test_1"/> > </instance_attributes> > </transient_attributes> Kill node virt-487 and DC node virt-492 at te same time: > [root@virt-487 ~]# echo b > /proc/sysrq-trigger > [root@virt-492 ~]# echo b > /proc/sysrq-trigger Nodes are fenced: > [root@virt-495 ~]# pcs status > Cluster name: STSRHTS2503 > Cluster Summary: > * Stack: corosync > * Current DC: virt-494 (version 2.1.0-11.el9-7c3f660707) - partition with quorum > * Last updated: Wed Sep 22 16:07:25 2021 > * Last change: Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487 > * 5 nodes configured > * 5 resource instances configured > Node List: > * Node virt-487: UNCLEAN (offline) > * Node virt-492: UNCLEAN (offline) > * Online: [ virt-493 virt-494 virt-495 ] > Full List of Resources: > * fence-virt-487 (stonith:fence_xvm): Starting [ virt-487 virt-493 ] > * fence-virt-492 (stonith:fence_xvm): Started [ virt-494 virt-492 ] > * fence-virt-493 (stonith:fence_xvm): Started virt-493 > * fence-virt-494 (stonith:fence_xvm): Started virt-494 > * fence-virt-495 (stonith:fence_xvm): Started virt-495 > Pending Fencing Actions: > * reboot of virt-492 pending: client=pacemaker-controld.1135303, origin=virt-494 > * reboot of virt-487 pending: client=pacemaker-controld.1135303, origin=virt-494 > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-495 ~]# pcs status --full > Cluster name: STSRHTS2503 > Cluster Summary: > * Stack: corosync > * Current DC: virt-494 (4) (version 2.1.0-11.el9-7c3f660707) - partition with quorum > * Last updated: Wed Sep 22 16:07:49 2021 > * Last change: Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487 > * 5 nodes configured > * 5 resource instances configured > Node List: > * Online: [ virt-493 (3) virt-494 (4) virt-495 (5) ] > * OFFLINE: [ virt-487 (1) virt-492 (2) ] > Full List of Resources: > * fence-virt-487 (stonith:fence_xvm): Started virt-493 > * fence-virt-492 (stonith:fence_xvm): Started virt-494 > * fence-virt-493 (stonith:fence_xvm): Started virt-493 > * fence-virt-494 (stonith:fence_xvm): Started virt-494 > * fence-virt-495 (stonith:fence_xvm): Started virt-495 > Migration Summary: > Failed Fencing Actions: > * reboot of virt-492 failed: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00' (a later attempt succeeded) > Fencing History: > * reboot of virt-492 successful: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:29 +02:00' > * reboot of virt-487 successful: delegate=virt-493, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00' > Tickets: > PCSD Status: > virt-487: Offline > virt-492: Offline > virt-493: Online > virt-494: Online > virt-495: Online > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Wait until nodes are rebooted: > [root@virt-495 ~]# pcs status --full > Cluster name: STSRHTS2503 > Cluster Summary: > * Stack: corosync > * Current DC: virt-494 (4) (version 2.1.0-11.el9-7c3f660707) - partition with quorum > * Last updated: Wed Sep 22 16:16:15 2021 > * Last change: Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487 > * 5 nodes configured > * 5 resource instances configured > Node List: > * Online: [ virt-487 (1) virt-492 (2) virt-493 (3) virt-494 (4) virt-495 (5) ] > Full List of Resources: > * fence-virt-487 (stonith:fence_xvm): Started virt-493 > * fence-virt-492 (stonith:fence_xvm): Started virt-494 > * fence-virt-493 (stonith:fence_xvm): Started virt-493 > * fence-virt-494 (stonith:fence_xvm): Started virt-494 > * fence-virt-495 (stonith:fence_xvm): Started virt-495 > Migration Summary: > Failed Fencing Actions: > * reboot of virt-492 failed: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00' (a later attempt succeeded) > Fencing History: > * reboot of virt-492 successful: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:29 +02:00' > * reboot of virt-487 successful: delegate=virt-493, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00' > Tickets: > PCSD Status: > virt-487: Online > virt-492: Online > virt-493: Online > virt-494: Online > virt-495: Online > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Check for transient attributes on node virt-487: > [root@virt-495 ~]# cibadmin --query --xpath '/cib/status/node_state[@uname="virt-487"]/transient_attributes' > Call cib_query failed (-6): No such device or address Transient attributes were removed. Verified as Sanity Only in pacemaker-2.1.0-11.el9 |