1961857 – pacemaker seems to end up in an unfence loop

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1961857 - pacemaker seems to end up in an unfence loop

Summary: pacemaker seems to end up in an unfence loop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.5
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1972273
TreeView+	depends on / blocked

Reported:	2021-05-18 20:22 UTC by Michele Baldessari
Modified:	2024-10-01 18:15 UTC (History)
CC List:	11 users (show)
Fixed In Version:	pacemaker-2.1.0-3.el8
Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker's controller on a node might be elected the Designated Controller (DC) before its attribute manager learned an already-active remote node is remote. Consequence: The node's scheduler would not see any of the remote node's node attributes. If the cluster used unfencing, this could result in an unfencing loop. Fix: The attribute manager can now learn a remote node is remote via additional events, including the initial attribute sync at start-up. Result: No unfencing loop occurs, regardless of which node is elected DC.
Clone Of:
Clones:	1972273 (view as bug list)
Environment:
Last Closed:	2021-11-09 18:44:54 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6534751	0	None	Closed	Scale down failure on the cluster	2022-06-23 09:11:33 UTC
Red Hat Product Errata	RHEA-2021:4267	0	None	None	None	2021-11-09 18:45:16 UTC

Description Michele Baldessari 2021-05-18 20:22:03 UTC

Description of problem:
So I was testing some new constraints around a resource called ovn-dbs-bundle in this Instance HA environment and by pure chance I noticed that the cluster seems to be stuck in a constant loop of unfencing the two compute nodes.

Namely I keep seeing the following:
May 18 20:13:14 messaging-0.redhat.local pacemaker-controld[141595]:  notice: State transition S_IDLE -> S_POLICY_ENGINE                                                                                                      [422/1436]
May 18 20:13:14 messaging-0.redhat.local pacemaker-schedulerd[141594]:  error: Stopping compute-unfence-trigger:0 until compute-0 can be unfenced
May 18 20:13:14 messaging-0.redhat.local pacemaker-schedulerd[141594]:  error: Stopping compute-unfence-trigger:1 until compute-1 can be unfenced
May 18 20:13:14 messaging-0.redhat.local pacemaker-schedulerd[141594]:  notice:  * Fence (on) compute-0 'required by compute-unfence-trigger:0 start'
May 18 20:13:14 messaging-0.redhat.local pacemaker-schedulerd[141594]:  notice:  * Fence (on) compute-1 'required by compute-unfence-trigger:1 start'
May 18 20:13:14 messaging-0.redhat.local pacemaker-schedulerd[141594]:  notice:  * Restart    compute-unfence-trigger:0              (                          compute-0 )   due to required stonith
May 18 20:13:14 messaging-0.redhat.local pacemaker-schedulerd[141594]:  notice:  * Restart    compute-unfence-trigger:1              (                          compute-1 )   due to required stonith
May 18 20:13:14 messaging-0.redhat.local pacemaker-schedulerd[141594]:  error: Calculated transition 71 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-42.bz2
May 18 20:13:14 messaging-0.redhat.local pacemaker-controld[141595]:  notice: Initiating stop operation compute-unfence-trigger_stop_0 on compute-0
May 18 20:13:14 messaging-0.redhat.local pacemaker-controld[141595]:  notice: Initiating stop operation compute-unfence-trigger_stop_0 on compute-1
May 18 20:13:14 messaging-0.redhat.local pacemaker-controld[141595]:  notice: Requesting fencing (on) of node compute-0
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Client pacemaker-controld.141595.362e0079 wants to fence (on) 'compute-0' with device '(any)'
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Requesting peer fencing (on) targeting compute-0
May 18 20:13:14 messaging-0.redhat.local pacemaker-controld[141595]:  notice: Requesting fencing (on) of node compute-1
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Client pacemaker-controld.141595.362e0079 wants to fence (on) 'compute-1' with device '(any)'
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Requesting peer fencing (on) targeting compute-1
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254006197b0 is not eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-52540061e462 is not eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254002f3ee2 is not eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-52540099d9d4 is not eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254004abb25 is not eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254007f35e5 is not eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-52540091b7d5 is not eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254005fe973 is eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-525400e04d3b is not eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-52540033a68d is not eligible to fence (on) compute-0: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254006197b0 is eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-52540061e462 is not eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254002f3ee2 is not eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-52540099d9d4 is not eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254004abb25 is not eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254007f35e5 is not eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-52540091b7d5 is not eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-5254005fe973 is not eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-525400e04d3b is not eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: stonith-fence_ipmilan-52540033a68d is not eligible to fence (on) compute-1: static-list
May 18 20:13:14 messaging-0.redhat.local runuser[291621]: pam_unix(runuser:session): session closed for user rabbitmq
May 18 20:13:19 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Requesting that database-1 perform 'on' action targeting compute-1 using 'stonith-fence_ipmilan-5254006197b0'
May 18 20:13:19 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Action 'on' targeting compute-1 using stonith-fence_ipmilan-5254006197b0 on behalf of pacemaker-controld.141595@messaging-0: OK
May 18 20:13:19 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Requesting that database-2 perform 'on' action targeting compute-1 using 'stonith-fence_compute-fence-nova'
May 18 20:13:19 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Requesting that messaging-0 perform 'on' action targeting compute-0 using 'stonith-fence_ipmilan-5254005fe973'
May 18 20:13:19 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Operation 'on' [291676] (call 114 from pacemaker-controld.141595) for host 'compute-0' with device 'stonith-fence_ipmilan-5254005fe973' returned: 0 (OK)
May 18 20:13:19 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Action 'on' targeting compute-0 using stonith-fence_ipmilan-5254005fe973 on behalf of pacemaker-controld.141595@messaging-0: OK
May 18 20:13:19 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Requesting that database-2 perform 'on' action targeting compute-0 using 'stonith-fence_compute-fence-nova'
May 18 20:13:21 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Action 'on' targeting compute-1 using stonith-fence_compute-fence-nova on behalf of pacemaker-controld.141595@messaging-0: OK
May 18 20:13:21 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Operation 'on' targeting compute-1 on database-2 for pacemaker-controld.141595: OK
May 18 20:13:21 messaging-0.redhat.local pacemaker-controld[141595]:  notice: Stonith operation 115/85:71:0:e660d42a-7190-450b-a5f7-d5491a657fd1: OK (0)
May 18 20:13:21 messaging-0.redhat.local pacemaker-controld[141595]:  notice: compute-1 was successfully unfenced by database-2 (at the request of messaging-0)
May 18 20:13:21 messaging-0.redhat.local pacemaker-controld[141595]:  notice: Initiating start operation compute-unfence-trigger_start_0 on compute-1
May 18 20:13:21 messaging-0.redhat.local pacemaker-attrd[141593]:  notice: Setting #node-unfenced[compute-1]: 1621368733 -> 1621368801
May 18 20:13:21 messaging-0.redhat.local pacemaker-attrd[141593]:  notice: Cannot update #node-unfenced[compute-0]=1621368731 because peer UUID not known (will retry if learned)
May 18 20:13:21 messaging-0.redhat.local pacemaker-attrd[141593]:  notice: Cannot update #node-unfenced[compute-1]=1621368801 because peer UUID not known (will retry if learned)
May 18 20:13:21 messaging-0.redhat.local pacemaker-controld[141595]:  notice: Initiating monitor operation compute-unfence-trigger_monitor_10000 on compute-1
May 18 20:13:23 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Action 'on' targeting compute-0 using stonith-fence_compute-fence-nova on behalf of pacemaker-controld.141595@messaging-0: OK
May 18 20:13:23 messaging-0.redhat.local pacemaker-fenced[141591]:  notice: Operation 'on' targeting compute-0 on database-2 for pacemaker-controld.141595: OK
May 18 20:13:23 messaging-0.redhat.local pacemaker-controld[141595]:  notice: Stonith operation 114/86:71:0:e660d42a-7190-450b-a5f7-d5491a657fd1: OK (0)
May 18 20:13:23 messaging-0.redhat.local pacemaker-controld[141595]:  notice: compute-0 was successfully unfenced by database-2 (at the request of messaging-0)


I suspect that the root cause is around these two messages:
May 18 20:13:21 messaging-0.redhat.local pacemaker-attrd[141593]:  notice: Cannot update #node-unfenced[compute-0]=1621368731 because peer UUID not known (will retry if learned)
May 18 20:13:21 messaging-0.redhat.local pacemaker-attrd[141593]:  notice: Cannot update #node-unfenced[compute-1]=1621368801 because peer UUID not known (will retry if learned)


Version-Release number of selected component (if applicable):
pacemaker-2.0.5-9.el8.x86_64

How reproducible:
Not sure yet. Still investigating. Full sosreports + /var/{lib,log}/pacemaker from the DC messaging-0 at http://file.rdu.redhat.com/~mbaldess/pcmk-unfence-loop/

I'll try to reproduce as well in the next days.

Comment 1 Ken Gaillot 2021-05-18 23:05:28 UTC

> I suspect that the root cause is around these two messages:
> May 18 20:13:21 messaging-0.redhat.local pacemaker-attrd[141593]:  notice: Cannot update #node-unfenced[compute-0]=1621368731 because peer UUID not known (will retry if learned)
> May 18 20:13:21 messaging-0.redhat.local pacemaker-attrd[141593]:  notice: Cannot update #node-unfenced[compute-1]=1621368801 because peer UUID not known (will retry if learned)

You are correct. The attribute manager cannot record the node attribute indicating that the node was unfenced into the CIB until it knows the node UUID, so the scheduler never realizes that the unfencing has already been done.

I'll have to investigate the sosreports to see why the UUID is unknown in this particular case, but I'm thinking we can close this as a duplicate of Bug 1905965 -- it's not the same problem, but that fix should take care of this, too. (We would track node-unfenced in the CIB node_state entry rather than as a node attribute.)

Comment 2 Ken Gaillot 2021-06-04 21:23:54 UTC

The fix for Bug 1905965 would help somewhat but would not be a complete fix in this case. The problem is that the node attribute manager doesn't know that the compute nodes are remote nodes.

Normally, all cluster nodes learn that a node is a remote node when the remote connection comes up. In this case, messaging-0 was down when the compute nodes came up, so it never learned that.

When a node comes up, the other nodes send it all known attributes, so messaging-0 did get a copy of the compute nodes' unfencing-related attributes. However, this sync does not include whether the relevant nodes are cluster nodes or remote nodes, so messaging-0 assumed the compute nodes were unseen cluster nodes, and couldn't record their node attributes.

When messaging-0 later became DC, nothing had yet happened that would make its attribute manager learn the remote nodes, so it still hadn't recorded those attributes, and its scheduler thought unfencing was still needed.

The fix will be to pass along the cluster vs remote info in the sync with newly joining nodes.

In the meantime, the only workaround I see in this situation is restarting the cluster on the affected node, so another node becomes DC.

How urgent is this, and what RHEL versions would you need the fix in?

Comment 3 Michele Baldessari 2021-06-07 06:18:42 UTC

Heya Ken,

so I don't think we've managed to reproduce this in the last month, so for now I'll cautiously say that we can aim for RHEL-9 for this for the time being. I'll update here if we spot this in the wild some more.

thanks,
Michele

Comment 6 Ken Gaillot 2021-06-14 22:33:47 UTC

@michele I'm narrowing down the circumstances under which this occurs. I see this after controller-2 left and rejoined:

May 18 18:36:05 controller-2.redhat.local pacemaker-based     [312491] (cib_process_request)    info: Forwarding cib_delete operation for section //cluster_property_set/nvpair[@name='tripleo-shutdown-lock' and @value='controller-2:1621363508']/.. to all (origin=local/cibadmin/2)

Are you calling cibadmin immediately after starting an upgraded node? If so, when/how/why?

Comment 7 Michele Baldessari 2021-06-15 05:37:16 UTC

Hi Ken,

I'll let Damien comment more in detail, but yes there is a script we use to coordinate certain operations across the control plane that uses some attributes to implement a distributed lock [1]

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/container_config_scripts/pacemaker_resource_lock.sh

Comment 8 Damien Ciabrini 2021-06-15 06:45:14 UTC

(In reply to Ken Gaillot from comment #6)
> Are you calling cibadmin immediately after starting an upgraded node? If so,
> when/how/why?

This is an idiomatic way for us to control the stop/restart of pacemaker nodes while upgrades are being executed concurrently.
The context is that we during upgrade of our Openstack control plane, we want to avoid shutting down several pacemaker nodes at once to avoid some spurious node shutdown [1,2]. More specifically, while some pacemaker nodes are upgraded sequentially to avoid service disruption (e.g. database nodes), there can be different type of nodes being shut down at the same time (a database node and a networker node), and we want to avoid that otherwise we would hit [1] or [2].

So the idea is to add a sort of coordination between nodes:
  . If a node wants to stop the pacemaker cluster locally, it first needs to set attribute 'tripleo-shutdown-lock' in the CIB and then stop the cluster.
  . If the attribute is set in the CIB, no other node is allowed to stop a pacemaker cluster node. Node requiring a shutdown needs to wait until the attribute is removed from the CIB, and then try to set it back to signal its own shutdown.
  . Once a node has stopped pacemaker, and finished its upgrade, it restarts pacemaker locally and clear up the attribute from the CIB.  

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1791841
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1872404

Comment 9 Ken Gaillot 2021-06-15 14:35:40 UTC

OK, I understand what happened. The combination of restarting all the cluster nodes one by one, and running cibadmin immediately after they are restarted, triggers a CIB conflict.

Pacemaker has an algorithm to keep the CIB synchronized across all cluster nodes. For example, if a cluster is partitioned, the CIB is modified differently in each partition, and then the partition heals, Pacemaker has to pick one of the new CIBs to keep and trash the other. The algorithm gives a preference to changes from cibadmin since they are made by the user.

You're using cibadmin to remove the tripleo lock immediately after the node starts the cluster, which creates a timing issue. The cibadmin change can be made effective locally before the node has rejoined the cluster's pacemaker-controld layer. When the node does rejoin, its CIB now looks "better" (because it came from cibadmin) than the cluster's current CIB and overwrites it, losing some information in the process, including transient node attributes.

Normally that's not too much of a problem, because the current attribute writer (a role similar to DC, and usually the same node) will write out all transient attributes back to the CIB. However, because every cluster node has been restarted at this point, and the remote nodes have not been restarted, the attribute writer doesn't know that the remote nodes are remote nodes, and can't write out their attributes. This includes unfencing information, so the DC's scheduler now thinks unfencing needs to be redone, and the attributes can never be written, so it keeps looping.

I have a fix ready for testing, but because this is a timing issue, a reliable reproducer might be difficult. I'm thinking it should be possible to bring the last cluster node back up with its corosync ports blocked, run cibadmin manually on it, then unblock corosync, but I haven't tried that yet.

A workaround (and better solution, really) would be to wait until the node has fully rejoined the cluster before running cibadmin.

Comment 12 Damien Ciabrini 2021-06-15 15:02:29 UTC

(In reply to Ken Gaillot from comment #9)
> A workaround (and better solution, really) would be to wait until the node
> has fully rejoined the cluster before running cibadmin.

Thanks for the insight and the great analysis!
In parallel, I think we can work out a fix on our side to wait until the node has rejoined the cluster before trying to touch the CIB, so we'll test that out and report shortly.

Comment 13 Ken Gaillot 2021-06-22 16:40:51 UTC

Fixed upstream as of commit 540d7413

Comment 21 Ken Gaillot 2021-07-13 15:00:46 UTC

FYI, a workaround just occurred to me, though I haven't tested it:

When a remote connection starts or migrates, the node hosting the connection learns the node is remote. Remote connections are live-migratable (no resources on the remote node should be affected by a migration). So, if we live-migrate the remote connection to each cluster node in turn, they all should learn its remoteness.

That can be accomplished by setting a location preference, e.g.

    pcs constraint location <remote-resource> prefers <node1>
    # wait until connection is finished migrating
    pcs constraint location show --full
    # get the constraint ID from the output of above
    pcs constraint remove <constraint-id>
    # repeat for each cluster node

That could be done for just the current DC to fix the immediate problem, but it should be done on any cluster node that has restarted since the remote nodes were up, in case they later become DC.

Comment 22 Ken Gaillot 2021-07-13 15:06:11 UTC

(In reply to Ken Gaillot from comment #21)
>     pcs constraint location <remote-resource> prefers <node1>
>     # wait until connection is finished migrating
>     pcs constraint location show --full
>     # get the constraint ID from the output of above
>     pcs constraint remove <constraint-id>
>     # repeat for each cluster node

A more convenient way of doing the same thing would be:

    pcs resource move <remote-resource> <node1> --wait
    pcs resource clear <remote-resource> --wait
    # repeat for each cluster node

Comment 23 Julia Marciano 2021-08-04 16:00:16 UTC

Verified. 
Based on verification procedure of pacemaker-2.0.5-9.el8_4.2 on OSP16.2/RHEL8.4. Please see https://bugzilla.redhat.com/show_bug.cgi?id=1972273#c3

Comment 24 Julia Marciano 2021-08-04 16:02:35 UTC

Verified. 
Based on verification procedure of pacemaker-2.0.5-9.el8_4.2 on OSP16.2/RHEL8.4. Please see https://bugzilla.redhat.com/show_bug.cgi?id=1972273#c3

Comment 27 errata-xmlrpc 2021-11-09 18:44:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:4267

Note You need to log in before you can comment on or make changes to this bug.