1787749 – Pending self-fence actions remain after hard reset of the DC node [RHEL 7]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1787749 - Pending self-fence actions remain after hard reset of the DC node [RHEL 7]

Summary: Pending self-fence actions remain after hard reset of the DC node [RHEL 7]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.7
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	7.9
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1787751 1870663
TreeView+	depends on / blocked

Reported:	2020-01-04 23:42 UTC by Reid Wahl
Modified:	2024-06-13 22:21 UTC (History)
CC List:	7 users (show)
Fixed In Version:	pacemaker-1.1.23-1.el7
Doc Type:	Bug Fix
Doc Text:	Cause: If the cluster node acting as DC scheduled its own fencing, it would track two copies of the fencing action, one of its own creation, and another from the executing node's resulting fence device query, but the fencing result would only complete one of them. Consequence: If a DC fencing action failed, it would get "stuck" in status displays as a pending action and could not be cleaned up. Fix: Pacemaker identifies the two operations as the same and tracks just one copy. Result: Pending fencing actions for the DC no longer appear as pending in status once they complete, even if they fail.
Clone Of:
Clones:	1787751 (view as bug list)
Environment:
Last Closed:	2020-09-29 20:03:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Cluster Labs	5401	None	None	None	2020-01-04 23:42:24 UTC
Red Hat Knowledge Base (Solution)	4713471	None	None	None	2020-01-05 21:55:28 UTC
Red Hat Product Errata	RHEA-2020:3951	None	None	None	2020-09-29 20:04:18 UTC

Description Reid Wahl 2020-01-04 23:42:25 UTC

Description of problem:

If the DC node is hard-reset while a self-fencing action is still in pending state, the pending action cannot be cleared from the display after the node boots back up and rejoins the cluster. This is problematic because if the node has been rebooted, fence actions that were pending prior to the reboot should be cleared. This seems to be only a display issue that does not impact behavior.

There may be other ways to reproduce the issue of pending fencing actions lingering post-reboot besides this scenario.

-----

Version-Release number of selected component (if applicable):

pacemaker-1.1.20-5.el7_7.1

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. One approach: Force a resource to fail to stop on the DC node, triggering a self-fencing action.
2. Hard-reset the node while the fence action is still in pending state. (Artifically adding latency to the fence script may help.)

-----

Actual results:

After reboot (no cleanup actions help):

Pending Fencing Actions:
* reboot of fastvm-rhel-7-6-21 pending: client=crmd.1604, origin=fastvm-rhel-7-6-21
* reboot of fastvm-rhel-7-6-21 pending: client=crmd.16708, origin=fastvm-rhel-7-6-22
* reboot of fastvm-rhel-7-6-21 pending: client=stonith-api.18068, origin=fastvm-rhel-7-6-22

-----

Expected results:

No pending fencing actions against the rebooted node after reboot.

-----

Additional info:

See also upstream Bug 5401 - [Problem] Pending Fencing Actions accumulate. (https://bugs.clusterlabs.org/show_bug.cgi?id=5401)

Comment 2 Reid Wahl 2020-01-05 00:19:13 UTC

The pending action list in the example above shows two pending actions from the other node, so this might not require a self-fencing action to trigger the behavior. Not sure what all conditions can trigger it.

Comment 3 Klaus Wenninger 2020-02-21 12:48:48 UTC

This isn't just cosmetic:
There is no way to manually clear pending actions - with good reason though as cleaning pending actions in general influences behaviour and not just history-logging.
Given that history recording is limited to 500 entries pending actions piling up over time can lead to successes/failures not being recorded at all anymore.

Comment 4 Patrik Hagara 2020-03-23 11:54:55 UTC

qa_ack+, reproducer in description

Comment 6 Patrik Hagara 2020-05-06 09:29:53 UTC

re-qacking, as fix is expected to be merged upstream soon

Comment 7 Ken Gaillot 2020-05-14 15:09:08 UTC

Fixed upstream by commit df71a07 in the 2.0 branch (which will be in RHEL 8.3 via Bug 1787751), backported to the 1.1 branch as commit cae1b8d (which will be in RHEL 7.9)

Comment 11 Patrik Hagara 2020-07-08 15:10:35 UTC

Unable to reproduce issue, neither on 7.7 with 1.1.20-5.el7 nor on 7.8 with pacemaker-1.1.21-4.el7. The pending fence operation always gets automatically resolved by another (successful) fence attempt. I've tried multiple reproducer variations I could think of (sysrq-halting the DC node, killing stonithd, letting the fence op time out, multi-level fencing, ...), always the same result.

@Ken: Please let me know if you think of some other way to reproduce this issue. I couldn't see anything obvious in the customer cases.

For now, I've verified the fix is included in 7.9's pacemaker-1.1.23-1.el7 package and also tried to (unsuccessfully) reproduce the issue on this fixed version.

Marking SanityOnly verified.

Comment 12 Ken Gaillot 2020-07-16 17:09:58 UTC

@Patrik: I believe this will reproduce it:

* Copy /usr/share/pacemaker/tests/cts/fence_dummy to /usr/sbin on all nodes
* Configure a device using fence_dummy with mode=fail and delay=10 (it will always fail), with pcmk_host_list limited to a single target node
* Configure the target node to have a fencing topology consisting of a single level with the dummy device as the first device and the real fencing device as the second
* Ensure the target node becomes DC (e.g. by rebooting all other nodes)
* Cause the target node to require fencing while still alive (e.g. by configuring a dummy resource with on-fail="fence" for the monitor, constrained to that node, and causing the monitor to fail)
* Wait to see the pending fencing in pcs status
* Hard-reset the node
* Reconfigure the dummy fence device with mode=pass, and the node should be fenced successfully after that

Comment 13 Ken Gaillot 2020-09-22 17:28:34 UTC

The reproducer in Comment 12 is incorrect. I can't remember how to reproduce an increasing list of "stuck" pending actions, but I am now able to reproduce a single "stuck" pending action.

The key elements of the situation are:
* There is a fencing topology for the DC node
* Fencing is required for the DC (while the node remains DC, which means it has not left the cluster)
* While the fencing is pending, the DC node leaves the cluster (but does *not* reboot or restart cluster services)
* While the DC node is out of the cluster, the pending fencing fails
* The DC node rejoins the cluster

The reproducer I used was:

1. Configure a cluster of at least two nodes with a real fencing device named "fencing-real".

2. Copy /usr/share/pacemaker/tests/cts/fence_dummy to /usr/sbin on all nodes.

3. Using the current DC's node name instead of $DC:
pcs stonith create fencing-dummy fence_dummy mode=fail delay=10 pcmk_host_list=$DC
pcs stonith level add 1 $DC fencing-dummy fencing-real
pcs resource create resource-dummy ocf:pacemaker:Dummy op monitor interval=10s on-fail=fence
pcs constraint location resource-dummy prefers $DC

4. Have "crm_mon" running on some node other than $DC, and wait until all resources are started.

5. On $DC, make the dummy resource fail, which will trigger fencing:
rm /run/Dummy-resource-dummy.state

6. Watch crm_mon and wait until it shows that the resource is failed and fencing of the DC is pending. While the fencing is still pending, block corosync on $DC to make the node leave the cluster:
firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP

8. Watch crm_mon and wait until it shows that the DC fencing failed, and a new fencing operation is pending.

9. On $DC, unblock corosync so the node rejoins the cluster:
firewall-cmd --direct --remove-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP

10. The old DC will be re-elected DC. Cluster status will look different on $DC and on any other node. $DC will reschedule fencing of itself but it will continue to fail at this point.

11. Change the dummy fencing device to succeed:
pcs stonith update fencing-dummy mode=pass

12. Watch crm_mon and $DC, and wait (potentially a long time) until $DC reboots and rejoins the cluster (you may have to manually start the node after it is fenced and start cluster services depending on how you configured everything). With the old code, the fencing will still be listed as pending even though it successfully completed (and even if you run "pcs stonith history cleanup" to erase the history); with the new code, no pending fencing will be shown.

Comment 15 errata-xmlrpc 2020-09-29 20:03:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3951

Note You need to log in before you can comment on or make changes to this bug.