Bug 1569615
Summary: | in behavior between the latest pacemaker package > 1.1.18-11.el7 compared to the previous version 1.1.16-12.el7_4.8 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jamie Reding <jamiere> |
Component: | doc-High_Availability_Add-On_Reference | Assignee: | Steven J. Levine <slevine> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | cluster-qe <cluster-qe> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | abeekhof, amit.banerjee, arsing, cluster-maint, dkinkead, kgaillot, rhel-docs |
Target Milestone: | rc | Keywords: | Documentation |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-08-27 20:58:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jamie Reding
2018-04-19 15:10:08 UTC
The new behavior is actually correct. Pacemaker always schedules promote actions after *all* pending start actions for the clone have completed (regardless of whether they are on the same node or not). The 7.4 behavior arises from Bug 1519379, in which Pacemaker mistakenly schedules a stop action instead of full recovery when a demote action fails. (The test agent used here fails the demote as well as the master monitor in this incident.) Because there is no start action immediately scheduled on the original master, the promote is allowed to proceed. Interestingly, the fix for Bug 1519379 is planned for 7.6 and not in the 7.5 packages, so 7.5 is coincidentally arriving at the correct behavior for other reasons. In 7.5, it does initially schedule a stop instead of full recovery -- but after the stop action completes and before the promote action can be initiated, the transition is interrupted by the result of clone notifications from the previous transition. Thus, a new transition is calculated which does schedule a start and a promote, and the promote is unable to proceed due to the failed start. While the observed behavior is not a bug, I am keeping this bz open for two unresolved questions: * Why are late clone notifications interrupting the transition in 7.5 and not 7.4? * The promote-after-all-starts behavior is intentional because it is the safest course of action. However, the situation is very similar to the purpose of the "interleave" clone meta-attribute, which controls whether another clone that depends on this clone must wait for all instances of the original clone to be up, or can start after just the local instance on the same node is up. It may make sense to make interleave apply to this situation as well, allowing promotion to occur after the local instance is started, regardless of other instance starts pending. To give a bit more information on the sequence of events here: In both 7.4 and 7.5, when the master monitor fails, these actions are scheduled: demote test1-master on original master promote test1-master on another node The bug is that the original master is scheduled for a demotion and a full stop, rather than demote-stop-start. Since there is no start, the promote is allowed to proceed in parallel with the demote. In both versions, the demote fails before the promote can be attempted. At this point, due to the bug, both versions schedule: stop test1-master on original master promote test1-master on another node In 7.4, the stop succeeds, and the promote succeeds. test1-master will not be started (in slave mode) on the original master until the next natural transition (due to some unrelated event or the cluster recheck interval). In 7.5, the stop succeeds, but before the promote can be initiated, late notification results abort the transition. Since the demote failure has been handled, the bug no longer applies, and the correct sequence is now scheduled: start test1-master on original master promote test1-master on another node Since the start never succeeds, the promote never happens. This is the intended behavior, as pacemaker knows nothing about how the cloned service functions, and cannot assume that promotion is safe when another node is starting in slave mode. (As mentioned before, the interleave option is probably a good way of allowing the user to say that it is safe, to avoid situations like this.) (In reply to Ken Gaillot from comment #2) > * Why are late clone notifications interrupting the transition in 7.5 and > not 7.4? This was a tricky one: a long-existing bug looks like a regression because a new feature triggers it. As of 7.5, the fix for Bug 1461976 switched the default for the record-pending property to true. This allows status displays to show actions that are in progress, rather than just their last known state. The newly found bug treats the record indicating a pending notify action as indicating its successful completion. Thus, when the notify action actually completes, it is treated as a "late" result that interrupts the transition. I will file a separate bz for that issue, and this bz can focus on the question of whether interleave should affect whether promotion waits for starts on other nodes. (In reply to Ken Gaillot from comment #4) > The newly found bug treats the record indicating a pending notify action as > indicating its successful completion. Thus, when the notify action actually > completes, it is treated as a "late" result that interrupts the transition. > > I will file a separate bz for that issue Created Bug 1570130 After discussing the issue with Andrew Beekhof, we will continue to order promotes after all starts. A start action is likely to change master scores, requiring a new calculation of which node will be master. Waiting for all pending starts to complete avoids unnecessarily moving the master repeatedly, and avoids resource agents having to have intelligence handling simultaneous promotion on one node and starting on another node, which could have race conditions. The behavior that arose here is a known risk of setting start-failure-is-fatal to false. When that is the case, it allows one faulty node (unable to start a resource) to hold up all dependent actions. That is why start-failure-is-fatal defaults to true. The risk of start-failure-is-fatal=false can be mitigated by setting a low migration-threshold so that other actions can proceed after that many failures. (Currently, migration-threshold is per resource, so unfortunately you can't set one threshold for start failures and another for monitor failures, but that capability is planned as the feature Bug 1371576.) The separate Bug 1570130 that created the timing issue that exposed this behavior has been fixed and is planned for the next z-stream batch as Bug 1570618. Be aware that this will restore the 7.4 behavior, but that is the result of Bug 1519379 which will be fixed in 7.6, which will bring us back to the behavior here. I'm reassigning this bz as a documentation issue. Docs: Where we document the behavior of start-failure-is-fatal, we should discuss the risk described in this comment. As I had said earlier, we intentionally ask customers to set start-failure-is-fatal to false. The resource may have failed to start now, but it might also succeed in a few minutes. We don't want start failures to prevent the replica from ever starting up on that node again until manual intervention. This is especially important since many customers run with the bare minimum number of cluster nodes (three) so they can only tolerate one replica being permanently removed in this fashion. So the recommendation to set start-failure-is-fatal to true does not work for our customers. Similarly, migration-threshold=N also permanently blocks the resource after N start failures, so it does not work for our customers. The only alternative I can see is that we change our resource agent to make `start` a no-op success and *actually* start the resource in `monitor`, or tell our customers to run a cron job with `pcs resource cleanup` every minute, or other such similarly hacky workaround. I don't disagree that promotes should be ordered after starts since starts can change the promotion score, but if a start has failed I don't see why pacemaker needs to block the promote on it. So while the behavior of 7.4 and earlier was because of a bug, it was also one that made more intuitive sense. Alternatively, perhaps you can add a new exit code for the start action that opts in to the "I failed, but I promise I'm fine with you promoting someone else" behavior. (In reply to Arnav Singh from comment #7) > As I had said earlier, we intentionally ask customers to set > start-failure-is-fatal to false. The resource may have failed to start now, > but it might also succeed in a few minutes. We don't want start failures to > prevent the replica from ever starting up on that node again until manual > intervention. One does not necessarily imply the other. We encourage start-failure-is-fatal=true _in_combination_with_ setting a failure timeout. This allows the cluster to attempt restarting the resource after some suitable period of time (during which the cluster is able to make progress promoting or demoting other copies). We use this combination extensively in other deployments. Thanks. I need to confirm this on the older versions of pacemaker that we support, but atleast for 1.1.16 and 1.1.18 on RHEL the combination of start-failure-is-fatal=true + failure-timeout=30s does work as you say. I know we looked at failure-timeout during development last year but I'm not sure why we discounted it. It *might* have been because the doc at https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_failure_response.html talks about it in the context of migration-threshold, then explains how migration-threshold does not apply if the resource fails the start action. Updated info is on the Portal in table 12.1: https://doc-stage.usersys.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/#tb-clusterprops-HAAR |