1989292 – On a three-node cluster if two nodes are hard-reset, sometimes the cluster ends up with unremovable transient attributes

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1989292 - On a three-node cluster if two nodes are hard-reset, sometimes the cluster ends up with unremovable transient attributes

Summary: On a three-node cluster if two nodes are hard-reset, sometimes the cluster en...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	9.0
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	beta
Target Release:	9.0 Beta
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1986998
Blocks:	1983952
TreeView+	depends on / blocked

Reported:	2021-08-02 20:39 UTC by Ken Gaillot
Modified:	2021-12-07 22:02 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pacemaker-2.1.0-9.el9
Doc Type:	Bug Fix
Doc Text:	Cause: If the DC and another node leave the cluster at the same time, either node might be listed first in the notification from Corosync, and Pacemaker will process them in order. Consequence: If the non-DC node is listed and processed first, its transient node attributes will not be cleared, leading to potential problems with resource agents or unfencing. Fix: Pacemaker now sorts the Corosync notification so that the DC node is always first. Result: Transient attributes are properly cleared when a node leaves the cluster, even if the DC leaves at the same time.
Clone Of:	1986998
Environment:
Last Closed:	2021-12-07 21:57:54 UTC
Type:	Bug
Target Upstream Version:	2.1.2
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-91983	0	None	None	None	2021-08-02 20:40:38 UTC

Description Ken Gaillot 2021-08-02 20:39:31 UTC

+++ This bug was initially created as a clone of Bug #1986998 +++

Description of problem:
One of our QE tests simulates the hard-reset of two out of three cluster nodes. The expectation is that eventually the cluster and its services can recover (we’re aware that depending on how long the network takes to recover an additional fence event might be triggered). For the most part this works. There are some cases though where we noticed that rabbitmq would not come up at all and we think we finally root-caused it today (thanks to Eck and Luca!)

rabbitmq-cluster uses a transient (-l reboot) attribute called rmq-node-attr-rabbitmq where it stores the nodes where rabbit is running. If rabbit is not running on the node it removes the attribute via “crm_attribute --verbose -N $NODENAME -l reboot -D --name rmq-node-attr-rabbitmq “. The problem is that the crm_attribute -D command is being run but the attribute is still there and never gets removed. To prove that we instrumented the RA to query attrd after the removal and we still see the attribute there:
Jul 28 15:27:25 controller-1 rabbitmq-cluster(rabbitmq)[13275]: INFO: crm_attribute -D called: controller-1 --> rmq-node-attr-rabbitmq: -> 0
Jul 28 15:27:25 controller-1 rabbitmq-cluster(rabbitmq)[13286]: INFO: crm_attribute --query output: controller-1 --> scope=status name=rmq-node-attr-rabbitmq value=rabbit@controller-1 

On a hosed system we could see that running crm_attribute -D by hand also did not fix anything (we tried from all three nodes), the value is somehow stuck in there. Pcs resource cleanup and pcs resource restart rabbitmq-bundle also do not fix it.
The only thing unblocking this state is to update the CIB (we used an additional bind-mount on the rabbitmq-bundle) and push it in the cluster. After that everything is unblocked and working.

<snip>

--- Additional comment from Ken Gaillot on 2021-08-02 17:34:52 UTC ---

When a non-DC node leaves the cluster, the DC clears its transient attributes. If the DC leaves the cluster, all nodes clear the DC's transient attributes.

The problem can occur if both the DC and another node leave at the same time. Pacemaker processes the node exit notification list from Corosync one by one. If a non-DC node happens to be listed before the DC node, Pacemaker on the surviving node(s) will process the non-DC node exit first, and won't be aware yet that the DC has left, so it will assume the DC is handling the clearing for that node.

The fix should be straightforward, we need to sort the exit list so the DC is always first if present.

Comment 1 Ken Gaillot 2021-08-10 16:34:11 UTC

Fixed upstream as of commit ee7eba6

Comment 2 Ken Gaillot 2021-08-12 14:40:00 UTC

QA: RHOSP QA will test the 8.4.z equivalent of this bz, so this bz can be tested for regressions only.

If you do want to reproduce it, it's straightforward:
1. Configure a cluster with at least 5 nodes (so quorum is retained if 2 are lost).
2. Choose two nodes: the DC node and a node with a lower Corosync node ID (if the DC has the lowest ID, just restart the cluster on that node, and another node will be elected DC).
3. Set a transient attribute on the non-DC node that was selected.
4. Kill both the nodes, and wait for the cluster to fence them.

Before this change, the transient attribute on the non-DC node will persist across the reboot of that node. After this change, it will not.

Comment 7 Ken Gaillot 2021-09-10 20:10:16 UTC

Since this was originally a RHOSP-related bz, RHOSP has verified the corresponding 8.4.z Bug 1989622, and this can get sanity-only testing

Comment 8 Markéta Smazová 2021-09-22 14:52:26 UTC

after fix
----------

>   [root@virt-495 ~]# rpm -q pacemaker
>   pacemaker-2.1.0-11.el9.x86_64


Setup 5 node cluster:

>   [root@virt-495 ~]# pcs status
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-492 (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 15:59:31 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Online: [ virt-487 virt-492 virt-493 virt-494 virt-495 ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Started virt-487
>     * fence-virt-492	(stonith:fence_xvm):	 Started virt-492
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled


Create transient attribute on node (virt-487) that has lower Corosync node ID than DC node (virt-492) :

>   [root@virt-495 ~]# crm_attribute --node virt-487 --name test_attribute --update test_1 --lifetime=reboot

>   [root@virt-495 ~]# pcs status --full
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-492 (2) (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 16:05:57 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Online: [ virt-487 (1) virt-492 (2) virt-493 (3) virt-494 (4) virt-495 (5) ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Started virt-487
>     * fence-virt-492	(stonith:fence_xvm):	 Started virt-492
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Node Attributes:
>     * Node: virt-487 (1):
>       * test_attribute                  	: test_1    

>   Migration Summary:

>   Tickets:

>   PCSD Status:
>     virt-487: Online
>     virt-492: Online
>     virt-493: Online
>     virt-494: Online
>     virt-495: Online

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


Check for transient attributes on node virt-487:

>   [root@virt-495 ~]# cibadmin --query --xpath '/cib/status/node_state[@uname="virt-487"]/transient_attributes'
>   <transient_attributes id="1">
>     <instance_attributes id="status-1">
>       <nvpair id="status-1-test_attribute" name="test_attribute" value="test_1"/>
>     </instance_attributes>
>   </transient_attributes>


Kill node virt-487 and DC node virt-492 at te same time:

>   [root@virt-487 ~]# echo b > /proc/sysrq-trigger
>   [root@virt-492 ~]# echo b > /proc/sysrq-trigger


Nodes are fenced:

>   [root@virt-495 ~]# pcs status
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-494 (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 16:07:25 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Node virt-487: UNCLEAN (offline)
>     * Node virt-492: UNCLEAN (offline)
>     * Online: [ virt-493 virt-494 virt-495 ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Starting [ virt-487 virt-493 ]
>     * fence-virt-492	(stonith:fence_xvm):	 Started [ virt-494 virt-492 ]
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Pending Fencing Actions:
>     * reboot of virt-492 pending: client=pacemaker-controld.1135303, origin=virt-494
>     * reboot of virt-487 pending: client=pacemaker-controld.1135303, origin=virt-494

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


>   [root@virt-495 ~]# pcs status --full
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-494 (4) (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 16:07:49 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Online: [ virt-493 (3) virt-494 (4) virt-495 (5) ]
>     * OFFLINE: [ virt-487 (1) virt-492 (2) ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-492	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Migration Summary:

>   Failed Fencing Actions:
>     * reboot of virt-492 failed: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00' (a later attempt succeeded)

>   Fencing History:
>     * reboot of virt-492 successful: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:29 +02:00'
>     * reboot of virt-487 successful: delegate=virt-493, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00'

>   Tickets:

>   PCSD Status:
>     virt-487: Offline
>     virt-492: Offline
>     virt-493: Online
>     virt-494: Online
>     virt-495: Online

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


Wait until nodes are rebooted:

>   [root@virt-495 ~]# pcs status --full
>   Cluster name: STSRHTS2503
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-494 (4) (version 2.1.0-11.el9-7c3f660707) - partition with quorum
>     * Last updated: Wed Sep 22 16:16:15 2021
>     * Last change:  Wed Sep 22 15:58:56 2021 by root via cibadmin on virt-487
>     * 5 nodes configured
>     * 5 resource instances configured

>   Node List:
>     * Online: [ virt-487 (1) virt-492 (2) virt-493 (3) virt-494 (4) virt-495 (5) ]

>   Full List of Resources:
>     * fence-virt-487	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-492	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-493	(stonith:fence_xvm):	 Started virt-493
>     * fence-virt-494	(stonith:fence_xvm):	 Started virt-494
>     * fence-virt-495	(stonith:fence_xvm):	 Started virt-495

>   Migration Summary:

>   Failed Fencing Actions:
>     * reboot of virt-492 failed: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00' (a later attempt succeeded)

>   Fencing History:
>     * reboot of virt-492 successful: delegate=virt-494, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:29 +02:00'
>     * reboot of virt-487 successful: delegate=virt-493, client=pacemaker-controld.1135303, origin=virt-494, completed='2021-09-22 16:07:27 +02:00'

>   Tickets:

>   PCSD Status:
>     virt-487: Online
>     virt-492: Online
>     virt-493: Online
>     virt-494: Online
>     virt-495: Online

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


Check for transient attributes on node virt-487:

>   [root@virt-495 ~]# cibadmin --query --xpath '/cib/status/node_state[@uname="virt-487"]/transient_attributes'
>   Call cib_query failed (-6): No such device or address

Transient attributes were removed.


Verified as Sanity Only in pacemaker-2.1.0-11.el9

Note You need to log in before you can comment on or make changes to this bug.