1200849 – crmd: Resource marked with failcount=INFINITY on start failure with on-fail=ignore

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1200849 - crmd: Resource marked with failcount=INFINITY on start failure with on-fail=ignore

Summary: crmd: Resource marked with failcount=INFINITY on start failure with on-fail=i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1200853 1205796
TreeView+	depends on / blocked

Reported:	2015-03-11 13:58 UTC by John Ruemker
Modified:	2019-10-10 09:40 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pacemaker-1.1.13-3.el7
Doc Type:	Bug Fix
Doc Text:	When a resource in a Pacemaker cluster failed to start, Pacemaker updated the resource's last failure time and incremented its fail count even if the "on-fail=ignore" option was used. This in some cases caused unintended resource migrations when a resource start failure occurred. Now, Pacemaker does not update the fail count when "on-fail=ignore" is used. As a result, the failure is displayed in the cluster status output, but is properly ignored and thus does not cause resource migration.
Clone Of:
Clones:	1200853 (view as bug list)
Environment:
Last Closed:	2015-11-19 12:12:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1361003	0	None	None	None	Never
Red Hat Product Errata	RHSA-2015:2383	0	normal	SHIPPED_LIVE	Moderate: pacemaker security, bug fix, and enhancement update	2015-11-19 10:49:49 UTC

Description John Ruemker 2015-03-11 13:58:56 UTC

Description of problem: When a resource is set with 'op start on-fail=ignore', pengine does attempt to ignore a start failure, but crmd still sets the failcount to INFINITY which eventually causes pengine to move the resource away anyways.  This effectively means there's no way to ignore start failures.  

Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 6: start hascript_start_0 on jrummy7-1.usersys.redhat.com (local)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_start_0: unknown error (node=jrummy7-1.usersys.redhat.com, call=29, rc=1, cib-update=169, confirmed=true)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by hascript_start_0 'modify' on jrummy7-1.usersys.redhat.com: Event failed (magic=0:1;6:97:0:9a793c1c-8193-4258-88cd-7f3d9ca48848, cib=0.17.1, source=match_graph_event:350, 0)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 97 (Complete=1, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3077.bz2): Stopped
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: notice: process_pe_message: Calculated Transition 98: /var/lib/pacemaker/pengine/pe-input-3078.bz2
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 8: monitor hascript_monitor_60000 on jrummy7-1.usersys.redhat.com (local)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by status-1-fail-count-hascript, fail-count-hascript=INFINITY: Transient attribute change (create cib=0.17.2, source=te_update_diff:391, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], 0)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_monitor_60000: not running (node=jrummy7-1.usersys.redhat.com, call=30, rc=7, cib-update=171, confirmed=false)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 8 (hascript_monitor_60000) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 7): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 98 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3078.bz2): Complete
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op_failure: Processing failed op monitor for hascript on jrummy7-1.usersys.redhat.com: not running (7)
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: common_apply_stickiness: Forcing hascript away from jrummy7-1.usersys.redhat.com after 1000000 failures (max=1000000)


Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7, although RHEL 7.0 releases are affected as well


How reproducible: Easily


Steps to Reproduce:
1. Create a resource that can fail on start, but then still be functional afterwards and return success on monitor.  I use a simple lsb script.

2. Set op start on-fail=ignore:

  # pcs resource update <resource> op start on-fail=ignore

3. Set a location constraint to prefer one node so you can control the conditions to cause it to fail on start on that node, but be successful on the other node (if your test resource requires such a step).

4. Enable resource such that it will fail on start, but will actually be running and monitor returns success.

Actual results: Resource will move to another node


Expected results: Resource stays put as long as monitor operations are successful


Additional info: Issue affects RHEL 6 as well.

Comment 1 John Ruemker 2015-03-11 14:02:58 UTC

I couldn't come up with a good solution here.  It seemed best for crmd to avoid setting failcount in the first place if on-fail=ignore, but there wasn't an immediately obvious way to make this setting available to it for the result it was processing.  

If you need me to test anything or provide any additional info, just let me know.

Comment 2 Andrew Beekhof 2015-03-31 01:14:37 UTC

We have the start-failure-is-fatal property, but thats a global setting.

Is there a specific reason for using on-fail=ignore on a start operation, or are you just experimenting?

Comment 3 John Ruemker 2015-03-31 13:45:26 UTC

The customer was looking for a way to essentially treat this as a non-critical resource that if it fails to start, it wouldn't move and bring others associated with it along for the ride.  They just want to attempt to start it and if something goes wrong with that operation, take no action.  Now I'm realizing as I'm writing this that I didn't ask how they expected the follow-up monitor operation to be handled, whether they thought the resource would eventually be seen as running or whether they were also going to ignore monitor failures as well.  I'll have to follow up on that.

Comment 4 John Ruemker 2015-03-31 13:50:14 UTC

Actually, in re-reading their comments, I think they may have miscommunicated in what they want.  Their earlier requests in the case before I got it state that if the resource fails, they just don't want it to "just mark the resource as failed and not attempt to start the resource on another node", which of course on-fail=stop would be the more appropriate setting for.  Their latest comments confused the matter a bit more, so I'm not really sure at this point. I'm going to follow up, so if you want to wait on that, that's fine.

I do still think its a problem in general that on-fail=ignore doesn't really produce the behavior you'd think it should, so my preference would be for us to fix this anyways.  But without a customer asking for it, if this is deemed not worthy of a change, I won't object.

Comment 5 Andrew Beekhof 2015-04-07 02:37:31 UTC

It is a decent amount of work, but I would agree that on-fail=ignore should work consistently for all oeprations, including start.

Comment 6 Andrew Beekhof 2015-04-08 21:27:05 UTC

Ken:

If on-fail=ignore for an action do not set a target-rc value.
In the crmd, then treat the operation as a success (no updating of failcount) if target-rc is not present.

If action=start, do not observe on-fail=ignore. Log a config error.

If the failed operation currently shows up in crm_mon, we should let it keep doing so and might need to revisit the above change.

Comment 12 Ken Gaillot 2015-04-17 17:30:41 UTC

After further investigation, the issue appears to be that, even with on-fail=ignore, the crmd will still update the fail-count and last-failure time. If the cluster-wide option start-failure-is-fatal is true (which is the default), it will update the fail-count to INFINITY, which will force an immediate migration away. If start-failure-is-fatal is false, it will only increment the fail-count, but that could still eventually trigger a migration depending on migration-threshold.

The planned fix is no longer what's in Comment 6. Instead, with on-fail=ignore the crmd will only update the last-failure time (allowing the failure to still be reported by crm_mon) and not the fail-count (allowing the cluster to ignore the error).

Comment 13 Ken Gaillot 2015-04-17 17:32:14 UTC

To clarify the relationship between on-fail (per-operation setting) and start-failure-is-fatal (cluster-wide setting), start-failure-is-fatal will be treated as a global default, and setting on-fail=ignore for a specific operation will override it for that operation.

Comment 15 Jaroslav Kortus 2015-09-15 14:04:49 UTC

Created a lsb resource that fails on one node (virt-031) and set a constraint to prefer that node.

pcs resource create bs lsb:badstarter op start on-fail=ignore

Full list of resources:

 fence-virt-024 (stonith:fence_xvm):    Started virt-024
 fence-virt-025 (stonith:fence_xvm):    Started virt-025
 fence-virt-031 (stonith:fence_xvm):    Started virt-031
 bs     (lsb:badstarter):	Stopped (failure ignored)

Failed actions:
    bs_start_0 on virt-031 'unknown error' (1): call=47, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 15:32:28
2015', queued=0ms, exec=8ms


Without on-fail=ignore:

Full list of resources:

 fence-virt-024 (stonith:fence_xvm):    Started virt-024
 fence-virt-025 (stonith:fence_xvm):    Started virt-025
 fence-virt-031 (stonith:fence_xvm):    Started virt-031
 bs     (lsb:badstarter):	Started virt-024

Failed actions:
    bs_start_0 on virt-031 'unknown error' (1): call=55, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 15:38:12
2015', queued=0ms, exec=9ms


[root@virt-031 init.d]# pcs resource failcount show bs
Failcounts for bs
 virt-031: INFINITY


Without on-fail ignore and with pcs property set  start-failure-is-fatal=false:

Full list of resources:

 fence-virt-024 (stonith:fence_xvm):    Started virt-024
 fence-virt-025 (stonith:fence_xvm):    Started virt-025
 fence-virt-031 (stonith:fence_xvm):    Started virt-031
 bs     (lsb:badstarter):	FAILED virt-031

Failed actions:
    bs_start_0 on virt-031 'unknown error' (1): call=609, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 16:01:56
 2015', queued=0ms, exec=8ms
    bs_start_0 on virt-031 'unknown error' (1): call=609, status=complete, exit-reason='none', last-rc-change='Tue Sep 15 16:01:56
 2015', queued=0ms, exec=8ms

[root@virt-031 init.d]# pcs resource failcount show bs
Failcounts for bs
 virt-031: 125


Works as described in comment 12.
Verified with pacemaker-1.1.12-22.el7.x86_64.

Comment 16 Jaroslav Kortus 2015-09-15 14:23:30 UTC

Previous comment was with the old version only, the new one is as follows:

Created a lsb resource that fails on one node (virt-031) and set a constraint to prefer that node.

pcs resource create bs lsb:badstarter op start on-fail=ignore

Full list of resources:

 bs     (lsb:badstarter):       Started virt-077 (failure ignored)

Failed Actions:
* bs_start_0 on virt-077 'unknown error' (1): call=1946, status=complete, exitreason='none',
    last-rc-change='Tue Sep 15 16:18:22 2015', queued=0ms, exec=22ms



Without on-fail=ignore:

Full list of resources:

 bs     (lsb:badstarter):       Started virt-075

Failed Actions:
* bs_start_0 on virt-077 'unknown error' (1): call=1966, status=complete, exitreason='none',
    last-rc-change='Tue Sep 15 16:20:39 2015', queued=0ms, exec=17ms



Without on-fail ignore and with pcs property set  start-failure-is-fatal=false:


 bs     (lsb:badstarter):       FAILED virt-077

Failed Actions:
* bs_start_0 on virt-077 'unknown error' (1): call=3186, status=complete, exitreason='none',
    last-rc-change='Tue Sep 15 16:21:59 2015', queued=0ms, exec=4ms



[root@virt-077 ~]# pcs resource failcount show bs
Failcounts for bs
 virt-077: 503


NOW it works as expected.
Verified with pacemaker-1.1.13-6.el7.x86_64.

Comment 17 errata-xmlrpc 2015-11-19 12:12:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2383.html

Note You need to log in before you can comment on or make changes to this bug.