Bug 2059638

Summary: Allow resource meta-attribute to exempt resource from node health restrictions
Product: Red Hat Enterprise Linux 8 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact: Steven J. Levine <slevine>
Priority: high    
Version: 8.6CC: cluster-maint, jrehova, msmazova, pzimek
Target Milestone: rcKeywords: FutureFeature, Triaged
Target Release: 8.7   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: pacemaker-2.1.3-1.el8 Doc Type: Enhancement
Doc Text:
.New `allow-unhealthy-node` Pacemaker resource meta-attribute Pacemaker now supports the `allow-unhealthy-node` resource meta-attribute. When this meta-attribute is set to `true`, the resource is not forced off a node due to degraded node health. When health resources have this attribute set, the cluster can automatically detect if the node's health recovers and move resources back to it.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-08 09:42:25 UTC Type: Feature Request
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ken Gaillot 2022-03-01 15:28:46 UTC
Description of problem: Pacemaker has a node health feature that allows certain OCF resource agents to be used to set node health attributes. If the node becomes unhealthy, and the appropriate health settings have been configured, Pacemaker will move all resources away from the node. The health agent itself will be moved away from the node, so the cluster will never learn when the node becomes healthy again, and the relevant node health attributes must be manually cleared to allow the node to be used again.


Version-Release number of selected component (if applicable): All


How reproducible: Trivially


Steps to Reproduce:
1. Create a cluster with at least two nodes, fencing, and a resource.
2. pcs property set node-health-strategy=migrate-on-red
3. Configure a health monitor using a cloned health agent (e.g. ocf:pacemaker:HealthCPU)
4. On one node, create the relevant condition to trigger the health monitor (e.g. run an infinite loop in a shell to max out the CPU)

Actual results: All resources including the health monitor are banned from the node, and once the condition ends, the resources do not move back

Expected results: All resources except the health monitor are banned from the node, and once the condition ends, the health monitor detects it and resources are allowed to move back

Comment 1 Ken Gaillot 2022-04-19 19:37:24 UTC
Feature merged upstream as of commit cc8ed479

Comment 7 jrehova 2022-07-11 12:44:03 UTC
* 2-node cluster
* dummy fence agent installed on one of node as /usr/sbin/fence_bz1978010: https://github.com/ClusterLabs/fence-agents/blob/master/agents/dummy/fence_dummy.py
* node-health-strategy=migrate-on-red
* cloned health agent ocf:pacemaker:HealthCPU on both nodes

Status of cluster:

> [root@virt-537 ~]# pcs status
> Cluster name: STSRHTS21465
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-538 (version 2.1.4-3.el8-dc6eb4362e) - partition with quorum
>   * Last updated: Mon Jul 11 15:39:23 2022
>   * Last change:  Mon Jul 11 15:39:17 2022 by root via cibadmin on virt-537
>   * 2 nodes configured
>   * 5 resource instances configured
> 
> Node List:
>   * Online: [ virt-537 virt-538 ]
> 
> Full List of Resources:
>   * fence-virt-537	(stonith:fence_xvm):	 Started virt-537
>   * fence-virt-538	(stonith:fence_xvm):	 Started virt-538
>   * Clone Set: resource_cpu-clone [resource_cpu]:
>     * Started: [ virt-537 virt-538 ]
>   * resource_dummy	(ocf::pacemaker:Dummy):	 Started virt-537
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Version of pacemaker:

> [root@virt-537 ~]# rpm -q pacemaker
> pacemaker-2.1.4-3.el8.x86_64

Setting health strategy of cluster and update resource:

> [root@virt-537 ~]# pcs property set node-health-strategy="migrate-on-red"
> [root@virt-537 ~]# pcs resource update resource_cpu meta allow-unhealthy-nodes=true

Condition:

> [root@virt-537 ~]# while true; do echo -n "test_cpu"; done
> cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cputest_cpu

Status of cluster and migration to node virt-538:

> [root@virt-537 ~]# pcs status
> Cluster name: STSRHTS21465
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-538 (version 2.1.4-3.el8-dc6eb4362e) - partition with quorum
>   * Last updated: Mon Jul 11 16:34:19 2022
>   * Last change:  Mon Jul 11 16:31:04 2022 by root via cibadmin on virt-537
>   * 2 nodes configured
>   * 5 resource instances configured
> 
> Node List:
>   * Node virt-537: online (health is RED)
>   * Online: [ virt-538 ]
> 
> Full List of Resources:
>   * fence-virt-537	(stonith:fence_xvm):	 Started virt-538
>   * fence-virt-538	(stonith:fence_xvm):	 Started virt-538
>   * Clone Set: resource_cpu-clone [resource_cpu]:
>     * Started: [ virt-538 ]
>     * Stopped: [ virt-537 ]
>   * resource_dummy	(ocf::pacemaker:Dummy):	 Started virt-538
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Comment 11 errata-xmlrpc 2022-11-08 09:42:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7573