Bug 1200849
Summary: | crmd: Resource marked with failcount=INFINITY on start failure with on-fail=ignore | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | John Ruemker <jruemker> | |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 7.1 | CC: | abeekhof, cluster-maint, fdinitto, jherrman, jkortus, jruemker, kgaillot, ovasik | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-1.1.13-3.el7 | Doc Type: | Bug Fix | |
Doc Text: |
When a resource in a Pacemaker cluster failed to start, Pacemaker updated the resource's last failure time and incremented its fail count even if the "on-fail=ignore" option was used. This in some cases caused unintended resource migrations when a resource start failure occurred. Now, Pacemaker does not update the fail count when "on-fail=ignore" is used. As a result, the failure is displayed in the cluster status output, but is properly ignored and thus does not cause resource migration.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1200853 (view as bug list) | Environment: | ||
Last Closed: | 2015-11-19 12:12:47 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1200853, 1205796 |
Description
John Ruemker
2015-03-11 13:58:56 UTC
I couldn't come up with a good solution here. It seemed best for crmd to avoid setting failcount in the first place if on-fail=ignore, but there wasn't an immediately obvious way to make this setting available to it for the result it was processing. If you need me to test anything or provide any additional info, just let me know. We have the start-failure-is-fatal property, but thats a global setting. Is there a specific reason for using on-fail=ignore on a start operation, or are you just experimenting? The customer was looking for a way to essentially treat this as a non-critical resource that if it fails to start, it wouldn't move and bring others associated with it along for the ride. They just want to attempt to start it and if something goes wrong with that operation, take no action. Now I'm realizing as I'm writing this that I didn't ask how they expected the follow-up monitor operation to be handled, whether they thought the resource would eventually be seen as running or whether they were also going to ignore monitor failures as well. I'll have to follow up on that. Actually, in re-reading their comments, I think they may have miscommunicated in what they want. Their earlier requests in the case before I got it state that if the resource fails, they just don't want it to "just mark the resource as failed and not attempt to start the resource on another node", which of course on-fail=stop would be the more appropriate setting for. Their latest comments confused the matter a bit more, so I'm not really sure at this point. I'm going to follow up, so if you want to wait on that, that's fine. I do still think its a problem in general that on-fail=ignore doesn't really produce the behavior you'd think it should, so my preference would be for us to fix this anyways. But without a customer asking for it, if this is deemed not worthy of a change, I won't object. It is a decent amount of work, but I would agree that on-fail=ignore should work consistently for all oeprations, including start. Ken: If on-fail=ignore for an action do not set a target-rc value. In the crmd, then treat the operation as a success (no updating of failcount) if target-rc is not present. If action=start, do not observe on-fail=ignore. Log a config error. If the failed operation currently shows up in crm_mon, we should let it keep doing so and might need to revisit the above change. After further investigation, the issue appears to be that, even with on-fail=ignore, the crmd will still update the fail-count and last-failure time. If the cluster-wide option start-failure-is-fatal is true (which is the default), it will update the fail-count to INFINITY, which will force an immediate migration away. If start-failure-is-fatal is false, it will only increment the fail-count, but that could still eventually trigger a migration depending on migration-threshold. The planned fix is no longer what's in Comment 6. Instead, with on-fail=ignore the crmd will only update the last-failure time (allowing the failure to still be reported by crm_mon) and not the fail-count (allowing the cluster to ignore the error). To clarify the relationship between on-fail (per-operation setting) and start-failure-is-fatal (cluster-wide setting), start-failure-is-fatal will be treated as a global default, and setting on-fail=ignore for a specific operation will override it for that operation. Created a lsb resource that fails on one node (virt-031) and set a constraint to prefer that node. pcs resource create bs lsb:badstarter op start on-fail=ignore Full list of resources: fence-virt-024 (stonith:fence_xvm): Started virt-024 fence-virt-025 (stonith:fence_xvm): Started virt-025 fence-virt-031 (stonith:fence_xvm): Started virt-031 bs (lsb:badstarter): Stopped (failure ignored) Failed actions: bs_start_0 on virt-031 'unknown error' (1): call=47, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 15:32:28 2015', queued=0ms, exec=8ms Without on-fail=ignore: Full list of resources: fence-virt-024 (stonith:fence_xvm): Started virt-024 fence-virt-025 (stonith:fence_xvm): Started virt-025 fence-virt-031 (stonith:fence_xvm): Started virt-031 bs (lsb:badstarter): Started virt-024 Failed actions: bs_start_0 on virt-031 'unknown error' (1): call=55, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 15:38:12 2015', queued=0ms, exec=9ms [root@virt-031 init.d]# pcs resource failcount show bs Failcounts for bs virt-031: INFINITY Without on-fail ignore and with pcs property set start-failure-is-fatal=false: Full list of resources: fence-virt-024 (stonith:fence_xvm): Started virt-024 fence-virt-025 (stonith:fence_xvm): Started virt-025 fence-virt-031 (stonith:fence_xvm): Started virt-031 bs (lsb:badstarter): FAILED virt-031 Failed actions: bs_start_0 on virt-031 'unknown error' (1): call=609, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 16:01:56 2015', queued=0ms, exec=8ms bs_start_0 on virt-031 'unknown error' (1): call=609, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 16:01:56 2015', queued=0ms, exec=8ms [root@virt-031 init.d]# pcs resource failcount show bs Failcounts for bs virt-031: 125 Works as described in comment 12. Verified with pacemaker-1.1.12-22.el7.x86_64. Previous comment was with the old version only, the new one is as follows: Created a lsb resource that fails on one node (virt-031) and set a constraint to prefer that node. pcs resource create bs lsb:badstarter op start on-fail=ignore Full list of resources: bs (lsb:badstarter): Started virt-077 (failure ignored) Failed Actions: * bs_start_0 on virt-077 'unknown error' (1): call=1946, status=complete, exitreason='none', last-rc-change='Tue Sep 15 16:18:22 2015', queued=0ms, exec=22ms Without on-fail=ignore: Full list of resources: bs (lsb:badstarter): Started virt-075 Failed Actions: * bs_start_0 on virt-077 'unknown error' (1): call=1966, status=complete, exitreason='none', last-rc-change='Tue Sep 15 16:20:39 2015', queued=0ms, exec=17ms Without on-fail ignore and with pcs property set start-failure-is-fatal=false: bs (lsb:badstarter): FAILED virt-077 Failed Actions: * bs_start_0 on virt-077 'unknown error' (1): call=3186, status=complete, exitreason='none', last-rc-change='Tue Sep 15 16:21:59 2015', queued=0ms, exec=4ms [root@virt-077 ~]# pcs resource failcount show bs Failcounts for bs virt-077: 503 NOW it works as expected. Verified with pacemaker-1.1.13-6.el7.x86_64. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2383.html |