Bug 1200849

Summary:	crmd: Resource marked with failcount=INFINITY on start failure with on-fail=ignore
Product:	Red Hat Enterprise Linux 7	Reporter:	John Ruemker <jruemker>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.1	CC:	abeekhof, cluster-maint, fdinitto, jherrman, jkortus, jruemker, kgaillot, ovasik
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	pacemaker-1.1.13-3.el7	Doc Type:	Bug Fix
Doc Text:	When a resource in a Pacemaker cluster failed to start, Pacemaker updated the resource's last failure time and incremented its fail count even if the "on-fail=ignore" option was used. This in some cases caused unintended resource migrations when a resource start failure occurred. Now, Pacemaker does not update the fail count when "on-fail=ignore" is used. As a result, the failure is displayed in the cluster status output, but is properly ignored and thus does not cause resource migration.	Story Points:	---
Clone Of:
Clones:	1200853 (view as bug list)		Environment:
Last Closed:	2015-11-19 12:12:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1200853, 1205796

Description John Ruemker 2015-03-11 13:58:56 UTC

Description of problem: When a resource is set with 'op start on-fail=ignore', pengine does attempt to ignore a start failure, but crmd still sets the failcount to INFINITY which eventually causes pengine to move the resource away anyways.  This effectively means there's no way to ignore start failures.  

Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 6: start hascript_start_0 on jrummy7-1.usersys.redhat.com (local)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_start_0: unknown error (node=jrummy7-1.usersys.redhat.com, call=29, rc=1, cib-update=169, confirmed=true)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by hascript_start_0 'modify' on jrummy7-1.usersys.redhat.com: Event failed (magic=0:1;6:97:0:9a793c1c-8193-4258-88cd-7f3d9ca48848, cib=0.17.1, source=match_graph_event:350, 0)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 97 (Complete=1, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3077.bz2): Stopped
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: notice: process_pe_message: Calculated Transition 98: /var/lib/pacemaker/pengine/pe-input-3078.bz2
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 8: monitor hascript_monitor_60000 on jrummy7-1.usersys.redhat.com (local)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by status-1-fail-count-hascript, fail-count-hascript=INFINITY: Transient attribute change (create cib=0.17.2, source=te_update_diff:391, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], 0)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_monitor_60000: not running (node=jrummy7-1.usersys.redhat.com, call=30, rc=7, cib-update=171, confirmed=false)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 8 (hascript_monitor_60000) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 7): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 98 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3078.bz2): Complete
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op_failure: Processing failed op monitor for hascript on jrummy7-1.usersys.redhat.com: not running (7)
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: common_apply_stickiness: Forcing hascript away from jrummy7-1.usersys.redhat.com after 1000000 failures (max=1000000)


Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7, although RHEL 7.0 releases are affected as well


How reproducible: Easily


Steps to Reproduce:
1. Create a resource that can fail on start, but then still be functional afterwards and return success on monitor.  I use a simple lsb script.

2. Set op start on-fail=ignore:

  # pcs resource update <resource> op start on-fail=ignore

3. Set a location constraint to prefer one node so you can control the conditions to cause it to fail on start on that node, but be successful on the other node (if your test resource requires such a step).

4. Enable resource such that it will fail on start, but will actually be running and monitor returns success.

Actual results: Resource will move to another node


Expected results: Resource stays put as long as monitor operations are successful


Additional info: Issue affects RHEL 6 as well.

Comment 1 John Ruemker 2015-03-11 14:02:58 UTC

I couldn't come up with a good solution here.  It seemed best for crmd to avoid setting failcount in the first place if on-fail=ignore, but there wasn't an immediately obvious way to make this setting available to it for the result it was processing.  

If you need me to test anything or provide any additional info, just let me know.

Comment 2 Andrew Beekhof 2015-03-31 01:14:37 UTC

We have the start-failure-is-fatal property, but thats a global setting.

Is there a specific reason for using on-fail=ignore on a start operation, or are you just experimenting?

Comment 3 John Ruemker 2015-03-31 13:45:26 UTC

The customer was looking for a way to essentially treat this as a non-critical resource that if it fails to start, it wouldn't move and bring others associated with it along for the ride.  They just want to attempt to start it and if something goes wrong with that operation, take no action.  Now I'm realizing as I'm writing this that I didn't ask how they expected the follow-up monitor operation to be handled, whether they thought the resource would eventually be seen as running or whether they were also going to ignore monitor failures as well.  I'll have to follow up on that.

Comment 4 John Ruemker 2015-03-31 13:50:14 UTC

Actually, in re-reading their comments, I think they may have miscommunicated in what they want.  Their earlier requests in the case before I got it state that if the resource fails, they just don't want it to "just mark the resource as failed and not attempt to start the resource on another node", which of course on-fail=stop would be the more appropriate setting for.  Their latest comments confused the matter a bit more, so I'm not really sure at this point. I'm going to follow up, so if you want to wait on that, that's fine.

I do still think its a problem in general that on-fail=ignore doesn't really produce the behavior you'd think it should, so my preference would be for us to fix this anyways.  But without a customer asking for it, if this is deemed not worthy of a change, I won't object.

Comment 5 Andrew Beekhof 2015-04-07 02:37:31 UTC

It is a decent amount of work, but I would agree that on-fail=ignore should work consistently for all oeprations, including start.

Comment 6 Andrew Beekhof 2015-04-08 21:27:05 UTC

Ken:

If on-fail=ignore for an action do not set a target-rc value.
In the crmd, then treat the operation as a success (no updating of failcount) if target-rc is not present.

If action=start, do not observe on-fail=ignore. Log a config error.

If the failed operation currently shows up in crm_mon, we should let it keep doing so and might need to revisit the above change.

Comment 12 Ken Gaillot 2015-04-17 17:30:41 UTC

After further investigation, the issue appears to be that, even with on-fail=ignore, the crmd will still update the fail-count and last-failure time. If the cluster-wide option start-failure-is-fatal is true (which is the default), it will update the fail-count to INFINITY, which will force an immediate migration away. If start-failure-is-fatal is false, it will only increment the fail-count, but that could still eventually trigger a migration depending on migration-threshold.

The planned fix is no longer what's in Comment 6. Instead, with on-fail=ignore the crmd will only update the last-failure time (allowing the failure to still be reported by crm_mon) and not the fail-count (allowing the cluster to ignore the error).

Comment 13 Ken Gaillot 2015-04-17 17:32:14 UTC

To clarify the relationship between on-fail (per-operation setting) and start-failure-is-fatal (cluster-wide setting), start-failure-is-fatal will be treated as a global default, and setting on-fail=ignore for a specific operation will override it for that operation.

Comment 15 Jaroslav Kortus 2015-09-15 14:04:49 UTC

Created a lsb resource that fails on one node (virt-031) and set a constraint to prefer that node.

pcs resource create bs lsb:badstarter op start on-fail=ignore

Full list of resources:

 fence-virt-024 (stonith:fence_xvm):    Started virt-024
 fence-virt-025 (stonith:fence_xvm):    Started virt-025
 fence-virt-031 (stonith:fence_xvm):    Started virt-031
 bs     (lsb:badstarter):	Stopped (failure ignored)

Failed actions:
    bs_start_0 on virt-031 'unknown error' (1): call=47, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 15:32:28
2015', queued=0ms, exec=8ms


Without on-fail=ignore:

Full list of resources:

 fence-virt-024 (stonith:fence_xvm):    Started virt-024
 fence-virt-025 (stonith:fence_xvm):    Started virt-025
 fence-virt-031 (stonith:fence_xvm):    Started virt-031
 bs     (lsb:badstarter):	Started virt-024

Failed actions:
    bs_start_0 on virt-031 'unknown error' (1): call=55, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 15:38:12
2015', queued=0ms, exec=9ms


[root@virt-031 init.d]# pcs resource failcount show bs
Failcounts for bs
 virt-031: INFINITY


Without on-fail ignore and with pcs property set  start-failure-is-fatal=false:

Full list of resources:

 fence-virt-024 (stonith:fence_xvm):    Started virt-024
 fence-virt-025 (stonith:fence_xvm):    Started virt-025
 fence-virt-031 (stonith:fence_xvm):    Started virt-031
 bs     (lsb:badstarter):	FAILED virt-031

Failed actions:
    bs_start_0 on virt-031 'unknown error' (1): call=609, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 16:01:56
 2015', queued=0ms, exec=8ms
    bs_start_0 on virt-031 'unknown error' (1): call=609, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 16:01:56
 2015', queued=0ms, exec=8ms

[root@virt-031 init.d]# pcs resource failcount show bs
Failcounts for bs
 virt-031: 125


Works as described in comment 12.
Verified with pacemaker-1.1.12-22.el7.x86_64.

Comment 16 Jaroslav Kortus 2015-09-15 14:23:30 UTC

Previous comment was with the old version only, the new one is as follows:

Created a lsb resource that fails on one node (virt-031) and set a constraint to prefer that node.

pcs resource create bs lsb:badstarter op start on-fail=ignore

Full list of resources:

 bs     (lsb:badstarter):       Started virt-077 (failure ignored)

Failed Actions:
* bs_start_0 on virt-077 'unknown error' (1): call=1946, status=complete, exitreason='none',
    last-rc-change='Tue Sep 15 16:18:22 2015', queued=0ms, exec=22ms



Without on-fail=ignore:

Full list of resources:

 bs     (lsb:badstarter):       Started virt-075

Failed Actions:
* bs_start_0 on virt-077 'unknown error' (1): call=1966, status=complete, exitreason='none',
    last-rc-change='Tue Sep 15 16:20:39 2015', queued=0ms, exec=17ms



Without on-fail ignore and with pcs property set  start-failure-is-fatal=false:


 bs     (lsb:badstarter):       FAILED virt-077

Failed Actions:
* bs_start_0 on virt-077 'unknown error' (1): call=3186, status=complete, exitreason='none',
    last-rc-change='Tue Sep 15 16:21:59 2015', queued=0ms, exec=4ms



[root@virt-077 ~]# pcs resource failcount show bs
Failcounts for bs
 virt-077: 503


NOW it works as expected.
Verified with pacemaker-1.1.13-6.el7.x86_64.

Comment 17 errata-xmlrpc 2015-11-19 12:12:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2383.html