Bug 2062358 - [RFE] Additional configurable failure recovery options for pacemaker managed resources
Summary: [RFE] Additional configurable failure recovery options for pacemaker managed ...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pacemaker
Version: 8.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Ken Gaillot
QA Contact: cluster-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-09 15:56 UTC by Shane Bradley
Modified: 2023-08-10 15:41 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Feature Request
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-115047 0 None None None 2022-03-09 16:01:38 UTC
Red Hat Knowledge Base (Solution) 6804701 0 None None None 2022-03-09 16:12:11 UTC

Description Shane Bradley 2022-03-09 15:56:53 UTC
Description of problem:
Requesting additional configurable failure recovery options for pacemaker managed resources.

For example a customer requested:
  "RFE to add something like a retry and/or retry_attempts option for pacemaker 
   resource monitor operations."


Version-Release number of selected component (if applicable):
Latest 8.5 pacemaker

How reproducible:
Does not apply

Steps to Reproduce:
Does not apply

Actual results:
Currently a monitor failure of a resource results in pacemaker performing the "on-fail" value (restart, ignore, fence, etc). 

Expected results:
Provide more options to pacemaker to handle monitor resource failures such as "retry X times before considering the resource monitor a failure". 

Additional info:

We spoke with engineering about this issue and they state there are some other bugzilla that are related to this RFE:

  - 1747559 – Allow operation failure timeouts to be configured per operation in Pacemaker 
    https://bugzilla.redhat.com/show_bug.cgi?id=1747559

  - 1328448 – RFE: start-failure-is-fatal as per-resource parameter instead of global property 
    https://bugzilla.redhat.com/show_bug.cgi?id=1328448

Comment 2 Ken Gaillot 2022-03-09 17:53:09 UTC
FYI, the currently proposed design is to replace the "on-fail", "migration-threshold" and "start-failure-is-fatal" options with new operation-specific options.

Each operation would take the new options "max-fail-ignore", "max-fail-restart", and "fail-escalation".

The first "max-fail-ignore" failures on a node would be reported but ignored. This would cover the "retry" concept in the Description.

If further failures occurred, the next "max-fail-restart" failures would be handled by attempting to restart the resource. As is the case currently, it is not guaranteed that the resource would be restarted on the same node -- the node would be determined by the usual means, and could be a different node if configurations or conditions have changed since the resource was initially placed. As an example, if the only reason the resource remained on its current node was due to stickiness, the stop will clear the stickiness, and the resource could be started on another node.

If the resource did stay on the same node, and another failure occurred, the handling specified by "fail-escalation" would be taken. This would accept the current "on-fail" values, except not including "restart", and adding "ban" to force the resource off the node.

The defaults would be chosen to keep the current default behavior: "max-fail-ignore" would default to 0, "max-fail-restart" would default to 0 for stop and start and INFINITY for other operations, and "fail-escalation" would default to block or fence for stop and ban for other operations.

Examples of how current options would translate to the new ones:

    on-fail=ignore -> max-fail-ignore=INFINITY
    migration-threshold=3 -> max-fail-restart=3

An example of attempting 4 restarts then leaving the resource stopped would be:

    max-fail-restart=4 fail-escalation=stop

An example of ignoring the first 2 failures then trying restart would be:

    max-fail-ignore=2 mail-fail-restart=INFINITY

This is very early in the design stage, so any of the above could change before the final version.


Note You need to log in before you can comment on or make changes to this bug.