1721603 – RFE: new pacemaker constraint for fencing actions that depend on regular resources being active

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1721603 - RFE: new pacemaker constraint for fencing actions that depend on regular resources being active

Summary: RFE: new pacemaker constraint for fencing actions that depend on regular reso...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.0
Hardware:	All
OS:	All
Priority:	low
Severity:	low
Target Milestone:	pre-dev-freeze
Target Release:	---
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-18 16:52 UTC by Ken Gaillot
Modified:	2021-02-01 14:55 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-01 07:41:34 UTC
Type:	Feature Request
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Cluster Labs	5466	0	None	None	None	2021-02-01 14:54:54 UTC

Description Ken Gaillot 2019-06-18 16:52:23 UTC

Description of problem: It is possible for a particular fence device to be usable for fencing only if a normal (non-fencing) cluster resource is active. (This is acceptable only if the fence device is configured such that it cannot target any nodes allowed to run the normal resource.)

OpenStack is an example where this can occur: if the controller nodes are pacemaker cluster nodes, and the compute nodes are pacemaker remote nodes, then unfencing the remote nodes with fence_compute requires access to the keystone authentication IP running as a pacemaker resource on a controller node.

Currently, this can result in a deadlock if both a dependent target node and the node running the normal resource must be fenced in the same cluster transition, the dependent resource is not functional, and the cluster serializes the normal resource's node fencing after the dependent target fencing (which can happen if the concurrent-fencing cluster property is false, or the resource's node is the DC, which implies it is scheduling itself for fencing for some reason other than complete node loss, such as a failed resource stop).

The proposed solution is a new constraint syntax that would specify the fencing device and the normal resource it depends on.

The most straightforward syntax would be to use pacemaker's current "rsc_order" constraint, with "first" set to the normal resource, "then" set to the fencing resource, and "then-action" set to the new value "fence". (Alternatively "then-action" could be set to "on", "off", or "reboot", but it seems more likely all fence actions would have the same requirement.)

Steps to Reproduce:
1. Configure a cluster such that 2 nodes can be fenced at the same time (e.g. 5 cluster nodes, or 2 cluster nodes plus a remote node).

2. Configure real fencing for all nodes.

3. Configure a normal resource, constrained to a single node (-INFINITY constraints on all other nodes).

4. Configure a fence device that fails if the normal resource is not active. (Dummy agents can be modified for this purpose, or fence_compute can be configured with the keystone IP as the normal resource.) The fence device should be configured in a topology with the real fencing to target some particular node other than the one that runs the normal resource.

5. Set concurrent-fencing=false for ease of testing.

6. Cause the normal resource to be nonfunction, and both the node running the normal resource and the node targeted by the fence device to require fencing (e.g. kill power on both simultaneously).

Actual results: If the cluster serializes the normal resource's node fencing last, the cluster will get stuck in a loop with the dependent fencing action repeatedly failing, and be unable to recover.

Expected results: The cluster eventually recovers properly.

Comment 1 Ken Gaillot 2019-06-18 17:04:35 UTC

A possible implementation would be for the scheduler to add a list of fence devices that do not have all their constraints satisfied, to fence actions. The controller would pass this along to the fencer with the fence request, and the fencer would treat such devices as disabled. (This is likely a better approach than having the fencer load the CIB and run a simulation to determine this on its own, due to the performance overhead and potential issues in mixed-version clusters.)

Comment 4 RHEL Program Management 2021-02-01 07:41:34 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 5 Ken Gaillot 2021-02-01 14:55:39 UTC

This issue has been reported upstream for exposure to a wider audience.

Note You need to log in before you can comment on or make changes to this bug.