Bug 2062358
| Summary: | [RFE] Additional configurable failure recovery options for pacemaker managed resources | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Shane Bradley <sbradley> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | NEW --- | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 8.6 | CC: | cluster-maint, nwahl, sbradley |
| Target Milestone: | rc | Keywords: | FutureFeature, Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Feature Request | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
FYI, the currently proposed design is to replace the "on-fail", "migration-threshold" and "start-failure-is-fatal" options with new operation-specific options.
Each operation would take the new options "max-fail-ignore", "max-fail-restart", and "fail-escalation".
The first "max-fail-ignore" failures on a node would be reported but ignored. This would cover the "retry" concept in the Description.
If further failures occurred, the next "max-fail-restart" failures would be handled by attempting to restart the resource. As is the case currently, it is not guaranteed that the resource would be restarted on the same node -- the node would be determined by the usual means, and could be a different node if configurations or conditions have changed since the resource was initially placed. As an example, if the only reason the resource remained on its current node was due to stickiness, the stop will clear the stickiness, and the resource could be started on another node.
If the resource did stay on the same node, and another failure occurred, the handling specified by "fail-escalation" would be taken. This would accept the current "on-fail" values, except not including "restart", and adding "ban" to force the resource off the node.
The defaults would be chosen to keep the current default behavior: "max-fail-ignore" would default to 0, "max-fail-restart" would default to 0 for stop and start and INFINITY for other operations, and "fail-escalation" would default to block or fence for stop and ban for other operations.
Examples of how current options would translate to the new ones:
on-fail=ignore -> max-fail-ignore=INFINITY
migration-threshold=3 -> max-fail-restart=3
An example of attempting 4 restarts then leaving the resource stopped would be:
max-fail-restart=4 fail-escalation=stop
An example of ignoring the first 2 failures then trying restart would be:
max-fail-ignore=2 mail-fail-restart=INFINITY
This is very early in the design stage, so any of the above could change before the final version.
|
Description of problem: Requesting additional configurable failure recovery options for pacemaker managed resources. For example a customer requested: "RFE to add something like a retry and/or retry_attempts option for pacemaker resource monitor operations." Version-Release number of selected component (if applicable): Latest 8.5 pacemaker How reproducible: Does not apply Steps to Reproduce: Does not apply Actual results: Currently a monitor failure of a resource results in pacemaker performing the "on-fail" value (restart, ignore, fence, etc). Expected results: Provide more options to pacemaker to handle monitor resource failures such as "retry X times before considering the resource monitor a failure". Additional info: We spoke with engineering about this issue and they state there are some other bugzilla that are related to this RFE: - 1747559 – Allow operation failure timeouts to be configured per operation in Pacemaker https://bugzilla.redhat.com/show_bug.cgi?id=1747559 - 1328448 – RFE: start-failure-is-fatal as per-resource parameter instead of global property https://bugzilla.redhat.com/show_bug.cgi?id=1328448