2059638 – Allow resource meta-attribute to exempt resource from node health restrictions

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2059638 - Allow resource meta-attribute to exempt resource from node health restrictions

Summary: Allow resource meta-attribute to exempt resource from node health restrictions

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.6
Hardware:	All
OS:	All
Priority:	high
Severity:	low
Target Milestone:	rc
Target Release:	8.7
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-01 15:28 UTC by Ken Gaillot
Modified:	2022-11-14 12:30 UTC (History)
CC List:	4 users (show)
Fixed In Version:	pacemaker-2.1.3-1.el8
Doc Type:	Enhancement
Doc Text:	.New `allow-unhealthy-node` Pacemaker resource meta-attribute Pacemaker now supports the `allow-unhealthy-node` resource meta-attribute. When this meta-attribute is set to `true`, the resource is not forced off a node due to degraded node health. When health resources have this attribute set, the cluster can automatically detect if the node's health recovers and move resources back to it.
Clone Of:
Environment:
Last Closed:	2022-11-08 09:42:25 UTC
Type:	Feature Request
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-114131	None	None	None	2022-03-01 15:32:55 UTC
Red Hat Knowledge Base (Solution)	6985056	None	None	None	2022-11-14 12:30:34 UTC
Red Hat Product Errata	RHBA-2022:7573	None	None	None	2022-11-08 09:42:42 UTC

Description Ken Gaillot 2022-03-01 15:28:46 UTC

Description of problem: Pacemaker has a node health feature that allows certain OCF resource agents to be used to set node health attributes. If the node becomes unhealthy, and the appropriate health settings have been configured, Pacemaker will move all resources away from the node. The health agent itself will be moved away from the node, so the cluster will never learn when the node becomes healthy again, and the relevant node health attributes must be manually cleared to allow the node to be used again.


Version-Release number of selected component (if applicable): All


How reproducible: Trivially


Steps to Reproduce:
1. Create a cluster with at least two nodes, fencing, and a resource.
2. pcs property set node-health-strategy=migrate-on-red
3. Configure a health monitor using a cloned health agent (e.g. ocf:pacemaker:HealthCPU)
4. On one node, create the relevant condition to trigger the health monitor (e.g. run an infinite loop in a shell to max out the CPU)

Actual results: All resources including the health monitor are banned from the node, and once the condition ends, the resources do not move back

Expected results: All resources except the health monitor are banned from the node, and once the condition ends, the health monitor detects it and resources are allowed to move back

Comment 1 Ken Gaillot 2022-04-19 19:37:24 UTC

Feature merged upstream as of commit cc8ed479

Comment 7 jrehova 2022-07-11 12:44:03 UTC

* 2-node cluster
* dummy fence agent installed on one of node as /usr/sbin/fence_bz1978010: https://github.com/ClusterLabs/fence-agents/blob/master/agents/dummy/fence_dummy.py
* node-health-strategy=migrate-on-red
* cloned health agent ocf:pacemaker:HealthCPU on both nodes

Status of cluster:

> [root@virt-537 ~]# pcs status
> Cluster name: STSRHTS21465
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-538 (version 2.1.4-3.el8-dc6eb4362e) - partition with quorum
>   * Last updated: Mon Jul 11 15:39:23 2022
>   * Last change:  Mon Jul 11 15:39:17 2022 by root via cibadmin on virt-537
>   * 2 nodes configured
>   * 5 resource instances configured
> 
> Node List:
>   * Online: [ virt-537 virt-538 ]
> 
> Full List of Resources:
>   * fence-virt-537	(stonith:fence_xvm):	 Started virt-537
>   * fence-virt-538	(stonith:fence_xvm):	 Started virt-538
>   * Clone Set: resource_cpu-clone [resource_cpu]:
>     * Started: [ virt-537 virt-538 ]
>   * resource_dummy	(ocf::pacemaker:Dummy):	 Started virt-537
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Version of pacemaker:

> [root@virt-537 ~]# rpm -q pacemaker
> pacemaker-2.1.4-3.el8.x86_64

Setting health strategy of cluster and update resource:

> [root@virt-537 ~]# pcs property set node-health-strategy="migrate-on-red"
> [root@virt-537 ~]# pcs resource update resource_cpu meta allow-unhealthy-nodes=true

Condition:

> [root@virt-537 ~]# while true; do echo -n "test_cpu"; done
> cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cpu

Status of cluster and migration to node virt-538:

> [root@virt-537 ~]# pcs status
> Cluster name: STSRHTS21465
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-538 (version 2.1.4-3.el8-dc6eb4362e) - partition with quorum
>   * Last updated: Mon Jul 11 16:34:19 2022
>   * Last change:  Mon Jul 11 16:31:04 2022 by root via cibadmin on virt-537
>   * 2 nodes configured
>   * 5 resource instances configured
> 
> Node List:
>   * Node virt-537: online (health is RED)
>   * Online: [ virt-538 ]
> 
> Full List of Resources:
>   * fence-virt-537	(stonith:fence_xvm):	 Started virt-538
>   * fence-virt-538	(stonith:fence_xvm):	 Started virt-538
>   * Clone Set: resource_cpu-clone [resource_cpu]:
>     * Started: [ virt-538 ]
>     * Stopped: [ virt-537 ]
>   * resource_dummy	(ocf::pacemaker:Dummy):	 Started virt-538
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Comment 11 errata-xmlrpc 2022-11-08 09:42:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7573

Note You need to log in before you can comment on or make changes to this bug.