2065818 – crm_resource --why should indicate when a resource is stopped due to a node's health

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2065818 - crm_resource --why should indicate when a resource is stopped due to a node's health

Summary: crm_resource --why should indicate when a resource is stopped due to a node's...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.6
Hardware:	All
OS:	All
Priority:	medium
Severity:	low
Target Milestone:	rc
Target Release:	8.7
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-18 19:46 UTC by Ken Gaillot
Modified:	2022-11-15 14:56 UTC (History)
CC List:	4 users (show)
Fixed In Version:	pacemaker-2.1.4-4.el8
Doc Type:	Enhancement
Doc Text:	Feature: The crm_resource command's --why option, and pcs resource cleanup, now indicate when a resource remains stopped on a node due to the node's health being degraded. Reason: If a user runs "pcs resource cleanup" for a resource, and the resource is still not running, it can be confusing as to what to look at next. The command would already indicate a few common conditions like the resource being disabled, but there would be no indication if the resource remained stopped because the node's health was degraded. Result: It is now easier to tell what to investigate if a resource is stopped due to a node's health being degraded.
Clone Of:
Environment:
Last Closed:	2022-11-08 09:42:30 UTC
Type:	Feature Request
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-116156	None	None	None	2022-03-18 19:51:57 UTC
Red Hat Knowledge Base (Solution)	6985482	None	None	None	2022-11-15 14:56:52 UTC
Red Hat Product Errata	RHBA-2022:7573	None	None	None	2022-11-08 09:42:40 UTC

Description Ken Gaillot 2022-03-18 19:46:14 UTC

Description of problem: Pacemaker's crm_resource command has a --why option to give some human-friendly reason why a particular resource is stopped. It currently handles a limited number of conditions, including target-role and whether the resource is managed or shutdown-locked. If a node is given on the command line (which is optional), the command should also check for whether the node is currently banned due to health status; if no node is given, the command could check whether all of the resource's allowed nodes are banned, but that might be overkill and nontrivial to implement.

Steps to Reproduce:
1. Configure and start a cluster.
2. Enable node health monitoring, for example: pcs property set node-health-strategy="migrate-on-red"
3. Configure a resource and ban it from all nodes except one
4. Simulate a node health monitor detecting a degraded condition by setting a node attribute appropriately for the node running the test resource, for example (replacing node name as appropriate): pcs node attribute $NODE '#health-test=red'
5. On any node, run (replacing resource and node names as appropriate): crm_resource --why --resource $RESOURCE --node $NODE

Actual results: A message saying only that "Resource dummy is not running on host $NODE"

Expected results: An additional line explaining why, something like "$NODE health score is $N (red) and node-health-strategy=migrate-on-red"

Additional info: pcs resource cleanup/refresh $RESOURCE uses the same code to show a reason if the resource remains stopped after the cleanup, so its output should be updated too

Comment 2 Ken Gaillot 2022-07-07 15:08:09 UTC

Fixed in upstream main branch as of commit 6630e55

Comment 6 jrehova 2022-08-04 12:19:25 UTC

* 2-node cluster

Version of pacemaker:

> [root@virt-008 ~]# rpm -q pacemaker
> pacemaker-2.1.4-4.el8.x86_64

Enabling node health monitoring:

> [root@virt-008 ~]# pcs property set node-health-strategy="migrate-on-red"
> [root@virt-008 ~]# pcs property
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: STSRHTS26647
>  dc-version: 2.1.4-4.el8-dc6eb4362e
>  have-watchdog: false
>  node-health-strategy: migrate-on-red

Create a resource:

> [root@virt-008 ~]# pcs resource create resource_dummy ocf:pacemaker:Dummy
> [root@virt-008 ~]# pcs status
> Cluster name: STSRHTS26647
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: virt-008 (version 2.1.4-4.el8-dc6eb4362e) - partition with quorum
>   * Last updated: Wed Aug  3 16:58:06 2022
>   * Last change:  Wed Aug  3 16:58:00 2022 by root via cibadmin on virt-008
>   * 2 nodes configured
>   * 3 resource instances configured
> 
> Node List:
>   * Online: [ virt-008 virt-009 ]
> 
> Full List of Resources:
>   * fence-virt-008  (stonith:fence_xvm):     Started virt-008
>   * fence-virt-009  (stonith:fence_xvm):     Started virt-009
>   * resource_dummy  (ocf::pacemaker:Dummy):  Started virt-008
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Simulating a node health degraded condition:

> [root@virt-008 ~]# pcs node attribute virt-008 '#health-test=red'

Result message:

> [root@virt-008 ~]# crm_resource --why --resource resource_dummy --node virt-008
> Resource resource_dummy is not running on host virt-008
> 'resource_dummy' cannot run on unhealthy nodes due to node-health-strategy='migrate-on-red'

Comment 8 errata-xmlrpc 2022-11-08 09:42:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7573

Note You need to log in before you can comment on or make changes to this bug.