Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1430457

Summary:	Add crm_mon option to hide stopped banned resources
Product:	Red Hat Enterprise Linux 8	Reporter:	Josef Zimek <pzimek>
Component:	pacemaker	Assignee:	Klaus Wenninger <kwenning>
Status:	CLOSED MIGRATED	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	8.0	CC:	aherr, cfeist, cluster-maint, ealcaniz, kgaillot, kwenning, michele, mnovacek
Target Milestone:	pre-dev-freeze	Keywords:	FutureFeature, MigratedToJIRA, Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:
Clones:	1470791 (view as bug list)		Environment:
Last Closed:	2023-09-22 18:19:41 UTC	Type:	Story
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1470791, 1473733, 1546815

Description Josef Zimek 2017-03-08 16:02:21 UTC

Description of problem:

In general pacemaker resource can be stopped for various reasons i.e. manually stopped, failed to start, failed during normal operation, has nowhere to run due to constraint restriction (allowed node is down etc.). 

RFE: Pacemaker should track the reason why resource is stopped in order to be able to tell that information to pcs which would then either display it in `pcs status` output or would be able to filter out resources stopped by specific action.

Some customers would like to have possibility to not display resources which are in stopped state because they have currently nowhere to run due to constraint limitations. The main reason behind this request is to make `pcs status` output more readable in case there are lot of resources (i.e. openstack clusters with lot of cloned resources etc.)

I already discussed this with pcs developers and apparently pcs currently doesn't have a way to figure out the reason why resource is stopped so it needs to come from pacemaker first. Then pcs could for example implement parameter to hide constrained resource from the output.


EXAMPLE:
========

For cloned cluster resources there are situations in which they cannot run on some of the nodes due to constraints configured in cluster. However the output of 'pcs status' is showing them as 'Stopped' on those nodes which makes the output unclear when there is larger number of nodes. Using 'pcs status --hide-inactive' doesn't solve this problem as it hides all stopped resources regardless of reason why they are stopped. We would like to have option that would hide the 'Stopped' nodes only if the current constraints are clearly prohibiting that node from running the resource.

== Environment:
Pacemaker cluster.

  # pcs property list
  ...
   Node Attributes:
     cnodep01: osp-controler=false
     cnodep02: osp-controler=false
     cnodep03: osp-controler=false
     cnodep04: osp-controler=false
     osconp01: osp-controler=true
     osconp02: osp-controler=true
     osconp03: osp-controler=true
   ...
   # pcs constraint list
   ...
    Resource: nova-compute-clone
      Constraint: location-nova-compute-clone-1
        Rule: score=0
          Expression: #kind eq remote
      Constraint: location-nova-compute-clone (resource-discovery=exclusive)
        Rule: score=0
          Expression: osp-controler eq false
   ...

== Current state:
  # pcs status
  ...
  Clone Set: nova-compute-clone [nova-compute]
     Started: [ cnodep01 cnodep02 cnodep03 cnodep04 ]
     Stopped: [ osconp01 osconp02 osconp03 ]
  ...

== Expected results:
## when all nodes are OK
## when some nodes that are controllers(osp-controler=true) are down/faulty
  # pcs status --hide-stopped-constrained
  ...
  Clone Set: nova-compute-clone [nova-compute]
     Started: [ cnodep01 cnodep02 cnodep03 cnodep04 ]     
  ...

## when some nodes that are not controllers(osp-controler=false) are down/faulty
## when some nodes both controllers and non-controllers are down/faulty
  # pcs status --hide-stopped-constrained
  ...
  Clone Set: nova-compute-clone [nova-compute]
     Started: [ cnodep01 cnodep04 ]
     Stopped: [ cnodep02 ]
     FAILED:  [ cnodep03 ]
  ...
============

Comment 2 Ken Gaillot 2017-03-08 17:10:35 UTC

Will be investigated in 7.5 timeframe

Comment 4 Ken Gaillot 2017-08-29 21:20:48 UTC

Due to a short time frame and limited capacity, this will not make 7.5.

Comment 7 Ken Gaillot 2018-06-26 15:00:26 UTC

A bit of further detail: Pacemaker already has a way to figure out why resources are stopped, for simple cases, and this information is already displayed in crm_mon (pcs status) and available to users and tools via crm_resource --why (or automatically displayed after a crm_resource --cleanup etc.).

I see two areas that could be covered by this bz:

* enhancing the "why" detection to cover more cases
* adding an option to crm_mon to hide banned resources

I do not think it is feasible for pcs to parse "why" reasons to implement filters there, nor to have wholly encompassing "why" detection, because (much like AI inputs and outputs) there are often multiple, complex, interacting factors that go into a stop decision, that simply aren't translatable into any easily followed natural language.

Looking outside this bz to more general concerns, some other areas that could be addressed:

* Static analysis of user configuration (conceptually an expanded version of crm_verify): This would intended to detect configuration issues that are likely the result of user mistakes. There is an upstream bug CLBZ#5277 asking for detection of constraint loops that would fit into this idea. A similar issue would be conflicting constraints that prevent a resource from starting. This would cover a significant part of the problem space this bz is intended to address.

* Greatly enhanced use of crm_mon interactive mode (and/or crm_mon HTML output, and/or higher-level GUI tools) to greatly reduce the amount of information displayed, with ability to expand items on demand. For example, the initial display might show just a few lines (roughly one for each display section today); selecting a line would expand it into a list (with various modifiers available in a menu to filter the list), and selecting a list item would give detail on that item.

Comment 8 Klaus Wenninger 2018-06-26 15:44:00 UTC

(In reply to Ken Gaillot from comment #7)

> * enhancing the "why" detection to cover more cases

I definitely agree that in many cases there isn't gonna be a simple,
single and easy to describe reason.
The current 'why'-implementation checks for 3 cases as far as I understand it.
When extending this I'm afraid of running into reimplementation of the pengine finally (both implementations then aren't totally insync on top probably).
That was why I rather had the idea of using pengine itself like having it calculate a transition noting some additional info on a side-track (can be conveyed with the transisiton, possibly switchable to reduce impact on a non-interactive run).

> * adding an option to crm_mon to hide banned resources

Neglecting the intial text a little and just going for the example a simple search for negative node-weights in placement-constraints might indeed be enough  to cope with that example.
This should be easy to do in crm_mon with the effects directly visible in pcs.

Comment 9 Ken Gaillot 2019-01-02 18:41:58 UTC

Looking more deeply at this, I think this BZ definitely needs to focus on the new crm_mon option, to have any chance of getting into 7.7. Adding more cases to the general "why" detection would have to be a separate bz, if we want to pursue that (note that a completely general, human-friendly, machine-parseable "why" as originally requested will never be possible due to the complexity of factors that can be involved in a decision).

For the interface, I'm thinking of expanding crm_mon's current --inactive argument to take an optional value, going from the current:

 -r, --inactive  Display inactive resources

to:

 -r, --inactive[=value]  Whether to show inactive resources
   0=do not show inactive (default if not specified)
   1=show inactive unless banned
   2=show all inactive (default if specified without value)

(Note that the default for crm_mon and pcs is swapped -- crm_mon does not show inactive resources by default, but pcs does. pcs currently calls crm_mon --inactive; it would use --inactive=1 for this BZ's purpose.)

Unfortunately the implementation looks more complicated than initially expected. It partly goes back to the same complexity -- when a constraint is applied, pacemaker doesn't necessarily know whether the resource will end up active or not (a ban can be outweighed by some other consideration), and by the end of resource allocation, pacemaker knows whether a resource is active, but the "reason" is a cumulative score summed up from many considerations, not a list of individual considerations.

Comment 10 Ken Gaillot 2019-03-18 17:42:50 UTC

Due to this project's unexpected complexity and time constraints, this will not make 7.7. I am moving this to RHEL 8 because RHEL 7 will not get new features after 7.7.

Comment 20 Ken Gaillot 2021-01-18 17:41:13 UTC

Filed an upstream bug that can be used for tracking if the stale rule kicks in again

Comment 25 RHEL Program Management 2023-09-22 18:17:27 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 26 RHEL Program Management 2023-09-22 18:19:41 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.