Bug 2123326

Summary: 'pcs status' should show why resource is not running
Product: Red Hat Enterprise Linux 9 Reporter: michal novacek <mnovacek>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: NEW --- QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0CC: cluster-maint, idevat, mlisik, mpospisi, omular, phagara, tojeline
Target Milestone: rcFlags: tojeline: needinfo? (mnovacek)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description michal novacek 2022-09-01 12:14:20 UTC
Description of problem:

Currently, the only reason cluster gives for resource being at stopped state is when it is disable. No other reasons are marked when the resource is stopped.

This is making management of complex clusters with several tenths of resource and complex ordering constraints very uneasy and VERY user unfriendly when some of the resources are stopped when they should be running. 

Making users understand that easily from 'pcs setup' output would greatly improve their understanding of cluster and decrease the frustration level resulting in more love for HA.


Actual results:

When the resource is not running (shown Stopped) user must either show manually failcounts to see whether the resource is stopped because of the number of failures or if not go grep log files to see what is the real reason it stopped.


Expected results:

Show (or optionally dont) in pcs status why the resources is not running. Currently it would be great help to show at least:

  * "failcount" for when resource was tried to start but did fail
  * "blocked" for when the resource is not running due to ordering constraint to other resource

Comment 1 Patrik Hagara 2022-09-01 12:51:58 UTC
FWIW, there is a `crm_resource --why` pacemaker command which can detect and report the reason(s) why a resource is not running, in some very specific cases (target role == stopped, unmanaged, locked, stopped due to node health attribute [1]). This was added in bz#1298581.

AFAIK, it is not currently integrated into pcs' status output, but it sure would be nice!

[1] https://github.com/ClusterLabs/pacemaker/blob/344b73b73d1097464744ac75f471c77276df0394/tools/crm_resource_runtime.c#L972-L983

Comment 2 Tomas Jelinek 2022-09-05 13:34:15 UTC
If you really want this info to be present in 'pcs status' output, then it needs to be done in pacemaker.
If you are fine with adding a pcs command to expose the `crm_resource --why` functionality, then that can be done in pcs.
Let us know which variants are you asking for.

Thanks.