Bug 1989292

Summary: On a three-node cluster if two nodes are hard-reset, sometimes the cluster ends up with unremovable transient attributes
Product: Red Hat Enterprise Linux 9 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED CURRENTRELEASE QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 9.0CC: cluster-maint, cluster-qe, eolivare, jeckersb, lmiccini, michele, msmazova, pkomarov
Target Milestone: betaKeywords: Triaged
Target Release: 9.0 Beta   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: pacemaker-2.1.0-9.el9 Doc Type: Bug Fix
Doc Text:
Cause: If the DC and another node leave the cluster at the same time, either node might be listed first in the notification from Corosync, and Pacemaker will process them in order. Consequence: If the non-DC node is listed and processed first, its transient node attributes will not be cleared, leading to potential problems with resource agents or unfencing. Fix: Pacemaker now sorts the Corosync notification so that the DC node is always first. Result: Transient attributes are properly cleared when a node leaves the cluster, even if the DC leaves at the same time.
Story Points: ---
Clone Of: 1986998 Environment:
Last Closed: 2021-12-07 21:57:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 2.1.2
Embargoed:
Bug Depends On: 1986998    
Bug Blocks: 1983952    

Description Ken Gaillot 2021-08-02 20:39:31 UTC
+++ This bug was initially created as a clone of Bug #1986998 +++

Description of problem:
One of our QE tests simulates the hard-reset of two out of three cluster nodes. The expectation is that eventually the cluster and its services can recover (we’re aware that depending on how long the network takes to recover an additional fence event might be triggered). For the most part this works. There are some cases though where we noticed that rabbitmq would not come up at all and we think we finally root-caused it today (thanks to Eck and Luca!)

rabbitmq-cluster uses a transient (-l reboot) attribute called rmq-node-attr-rabbitmq where it stores the nodes where rabbit is running. If rabbit is not running on the node it removes the attribute via “crm_attribute --verbose -N $NODENAME -l reboot -D --name rmq-node-attr-rabbitmq “. The problem is that the crm_attribute -D command is being run but the attribute is still there and never gets removed. To prove that we instrumented the RA to query attrd after the removal and we still see the attribute there:
Jul 28 15:27:25 controller-1 rabbitmq-cluster(rabbitmq)[13275]: INFO: crm_attribute -D called: controller-1 --> rmq-node-attr-rabbitmq: -> 0
Jul 28 15:27:25 controller-1 rabbitmq-cluster(rabbitmq)[13286]: INFO: crm_attribute --query output: controller-1 --> scope=status name=rmq-node-attr-rabbitmq value=rabbit@controller-1 

On a hosed system we could see that running crm_attribute -D by hand also did not fix anything (we tried from all three nodes), the value is somehow stuck in there. Pcs resource cleanup and pcs resource restart rabbitmq-bundle also do not fix it.
The only thing unblocking this state is to update the CIB (we used an additional bind-mount on the rabbitmq-bundle) and push it in the cluster. After that everything is unblocked and working.

<snip>

--- Additional comment from Ken Gaillot on 2021-08-02 17:34:52 UTC ---

When a non-DC node leaves the cluster, the DC clears its transient attributes. If the DC leaves the cluster, all nodes clear the DC's transient attributes.

The problem can occur if both the DC and another node leave at the same time. Pacemaker processes the node exit notification list from Corosync one by one. If a non-DC node happens to be listed before the DC node, Pacemaker on the surviving node(s) will process the non-DC node exit first, and won't be aware yet that the DC has left, so it will assume the DC is handling the clearing for that node.

The fix should be straightforward, we need to sort the exit list so the DC is always first if present.

Comment 1 Ken Gaillot 2021-08-10 16:34:11 UTC
Fixed upstream as of commit ee7eba6

Comment 2 Ken Gaillot 2021-08-12 14:40:00 UTC
QA: RHOSP QA will test the 8.4.z equivalent of this bz, so this bz can be tested for regressions only.

If you do want to reproduce it, it's straightforward:
1. Configure a cluster with at least 5 nodes (so quorum is retained if 2 are lost).
2. Choose two nodes: the DC node and a node with a lower Corosync node ID (if the DC has the lowest ID, just restart the cluster on that node, and another node will be elected DC).
3. Set a transient attribute on the non-DC node that was selected.
4. Kill both the nodes, and wait for the cluster to fence them.

Before this change, the transient attribute on the non-DC node will persist across the reboot of that node. After this change, it will not.

Comment 7 Ken Gaillot 2021-09-10 20:10:16 UTC
Since this was originally a RHOSP-related bz, RHOSP has verified the corresponding 8.4.z Bug 1989622, and this can get sanity-only testing

Comment 8 Markéta Smazová 2021-09-22 14:52:26 UTC
after fix
----------

>   [root@virt-495 ~]# rpm -q pacemaker
>   pacemaker-2.1.0-11.el9.x86_64


Setup 5 node cluster:

>   [root@virt-495 ~]# pcs status
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-492 (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 15:59:31 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Online: [ virt-487 virt-492 virt-493 virt-494 virt-495 ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Started virt-487
>     * fence-virt-492	(stonith:fence_xvm):	 Started virt-492
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled


Create transient attribute on node (virt-487) that has lower Corosync node ID than DC node (virt-492) :

>   [root@virt-495 ~]# crm_attribute --node virt-487 --name test_attribute --update test_1 --lifetime=reboot

>   [root@virt-495 ~]# pcs status --full
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-492 (2) (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 16:05:57 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Online: [ virt-487 (1) virt-492 (2) virt-493 (3) virt-494 (4) virt-495 (5) ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Started virt-487
>     * fence-virt-492	(stonith:fence_xvm):	 Started virt-492
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Node Attributes:
>     * Node: virt-487 (1):
>       * test_attribute                  	: test_1    

>   Migration Summary:

>   Tickets:

>   PCSD Status:
>     virt-487: Online
>     virt-492: Online
>     virt-493: Online
>     virt-494: Online
>     virt-495: Online

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


Check for transient attributes on node virt-487:

>   [root@virt-495 ~]# cibadmin --query --xpath '/cib/status/node_state[@uname="virt-487"]/transient_attributes'
>   <transient_attributes id="1">
>     <instance_attributes id="status-1">
>       <nvpair id="status-1-test_attribute" name="test_attribute" value="test_1"/>
>     </instance_attributes>
>   </transient_attributes>


Kill node virt-487 and DC node virt-492 at te same time:

>   [root@virt-487 ~]# echo b > /proc/sysrq-trigger
>   [root@virt-492 ~]# echo b > /proc/sysrq-trigger


Nodes are fenced:

>   [root@virt-495 ~]# pcs status
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-494 (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 16:07:25 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Node virt-487: UNCLEAN (offline)
>     * Node virt-492: UNCLEAN (offline)
>     * Online: [ virt-493 virt-494 virt-495 ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Starting [ virt-487 virt-493 ]
>     * fence-virt-492	(stonith:fence_xvm):	 Started [ virt-494 virt-492 ]
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Pending Fencing Actions:
>     * reboot of virt-492 pending: client=pacemaker-controld.1135303, origin=virt-494
>     * reboot of virt-487 pending: client=pacemaker-controld.1135303, origin=virt-494

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


>   [root@virt-495 ~]# pcs status --full
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-494 (4) (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 16:07:49 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Online: [ virt-493 (3) virt-494 (4) virt-495 (5) ]
>     * OFFLINE: [ virt-487 (1) virt-492 (2) ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-492	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Migration Summary:

>   Failed Fencing Actions:
>     * reboot of virt-492 failed: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00' (a later attempt succeeded)

>   Fencing History:
>     * reboot of virt-492 successful: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:29 +02:00'
>     * reboot of virt-487 successful: delegate=virt-493, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00'

>   Tickets:

>   PCSD Status:
>     virt-487: Offline
>     virt-492: Offline
>     virt-493: Online
>     virt-494: Online
>     virt-495: Online

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


Wait until nodes are rebooted:

>   [root@virt-495 ~]# pcs status --full
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-494 (4) (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 16:16:15 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Online: [ virt-487 (1) virt-492 (2) virt-493 (3) virt-494 (4) virt-495 (5) ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-492	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Migration Summary:

>   Failed Fencing Actions:
>     * reboot of virt-492 failed: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00' (a later attempt succeeded)

>   Fencing History:
>     * reboot of virt-492 successful: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:29 +02:00'
>     * reboot of virt-487 successful: delegate=virt-493, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00'

>   Tickets:

>   PCSD Status:
>     virt-487: Online
>     virt-492: Online
>     virt-493: Online
>     virt-494: Online
>     virt-495: Online

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


Check for transient attributes on node virt-487:

>   [root@virt-495 ~]# cibadmin --query --xpath '/cib/status/node_state[@uname="virt-487"]/transient_attributes'
>   Call cib_query failed (-6): No such device or address

Transient attributes were removed.


Verified as Sanity Only in pacemaker-2.1.0-11.el9