Bug 2059638 - Allow resource meta-attribute to exempt resource from node health restrictions
Summary: Allow resource meta-attribute to exempt resource from node health restrictions
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pacemaker
Version: 8.6
Hardware: All
OS: All
high
low
Target Milestone: rc
: 8.7
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-01 15:28 UTC by Ken Gaillot
Modified: 2022-11-14 12:30 UTC (History)
4 users (show)

Fixed In Version: pacemaker-2.1.3-1.el8
Doc Type: Enhancement
Doc Text:
.New `allow-unhealthy-node` Pacemaker resource meta-attribute Pacemaker now supports the `allow-unhealthy-node` resource meta-attribute. When this meta-attribute is set to `true`, the resource is not forced off a node due to degraded node health. When health resources have this attribute set, the cluster can automatically detect if the node's health recovers and move resources back to it.
Clone Of:
Environment:
Last Closed: 2022-11-08 09:42:25 UTC
Type: Feature Request
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-114131 0 None None None 2022-03-01 15:32:55 UTC
Red Hat Knowledge Base (Solution) 6985056 0 None None None 2022-11-14 12:30:34 UTC
Red Hat Product Errata RHBA-2022:7573 0 None None None 2022-11-08 09:42:42 UTC

Description Ken Gaillot 2022-03-01 15:28:46 UTC
Description of problem: Pacemaker has a node health feature that allows certain OCF resource agents to be used to set node health attributes. If the node becomes unhealthy, and the appropriate health settings have been configured, Pacemaker will move all resources away from the node. The health agent itself will be moved away from the node, so the cluster will never learn when the node becomes healthy again, and the relevant node health attributes must be manually cleared to allow the node to be used again.


Version-Release number of selected component (if applicable): All


How reproducible: Trivially


Steps to Reproduce:
1. Create a cluster with at least two nodes, fencing, and a resource.
2. pcs property set node-health-strategy=migrate-on-red
3. Configure a health monitor using a cloned health agent (e.g. ocf:pacemaker:HealthCPU)
4. On one node, create the relevant condition to trigger the health monitor (e.g. run an infinite loop in a shell to max out the CPU)

Actual results: All resources including the health monitor are banned from the node, and once the condition ends, the resources do not move back

Expected results: All resources except the health monitor are banned from the node, and once the condition ends, the health monitor detects it and resources are allowed to move back

Comment 1 Ken Gaillot 2022-04-19 19:37:24 UTC
Feature merged upstream as of commit cc8ed479

Comment 7 jrehova 2022-07-11 12:44:03 UTC
* 2-node cluster
* dummy fence agent installed on one of node as /usr/sbin/fence_bz1978010: https://github.com/ClusterLabs/fence-agents/blob/master/agents/dummy/fence_dummy.py
* node-health-strategy=migrate-on-red
* cloned health agent ocf:pacemaker:HealthCPU on both nodes

Status of cluster:

> [root@virt-537 ~]# pcs status
> Cluster name: STSRHTS21465
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-538 (version 2.1.4-3.el8-dc6eb4362e) - partition with quorum
>   * Last updated: Mon Jul 11 15:39:23 2022
>   * Last change:  Mon Jul 11 15:39:17 2022 by root via cibadmin on virt-537
>   * 2 nodes configured
>   * 5 resource instances configured
> 
> Node List:
>   * Online: [ virt-537 virt-538 ]
> 
> Full List of Resources:
>   * fence-virt-537	(stonith:fence_xvm):	 Started virt-537
>   * fence-virt-538	(stonith:fence_xvm):	 Started virt-538
>   * Clone Set: resource_cpu-clone [resource_cpu]:
>     * Started: [ virt-537 virt-538 ]
>   * resource_dummy	(ocf::pacemaker:Dummy):	 Started virt-537
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Version of pacemaker:

> [root@virt-537 ~]# rpm -q pacemaker
> pacemaker-2.1.4-3.el8.x86_64

Setting health strategy of cluster and update resource:

> [root@virt-537 ~]# pcs property set node-health-strategy="migrate-on-red"
> [root@virt-537 ~]# pcs resource update resource_cpu meta allow-unhealthy-nodes=true

Condition:

> [root@virt-537 ~]# while true; do echo -n "test_cpu"; done
> cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cpu

Status of cluster and migration to node virt-538:

> [root@virt-537 ~]# pcs status
> Cluster name: STSRHTS21465
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-538 (version 2.1.4-3.el8-dc6eb4362e) - partition with quorum
>   * Last updated: Mon Jul 11 16:34:19 2022
>   * Last change:  Mon Jul 11 16:31:04 2022 by root via cibadmin on virt-537
>   * 2 nodes configured
>   * 5 resource instances configured
> 
> Node List:
>   * Node virt-537: online (health is RED)
>   * Online: [ virt-538 ]
> 
> Full List of Resources:
>   * fence-virt-537	(stonith:fence_xvm):	 Started virt-538
>   * fence-virt-538	(stonith:fence_xvm):	 Started virt-538
>   * Clone Set: resource_cpu-clone [resource_cpu]:
>     * Started: [ virt-538 ]
>     * Stopped: [ virt-537 ]
>   * resource_dummy	(ocf::pacemaker:Dummy):	 Started virt-538
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Comment 11 errata-xmlrpc 2022-11-08 09:42:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7573


Note You need to log in before you can comment on or make changes to this bug.