1200853 – crmd: Resource marked with failcount=INFINITY on start failure with on-fail=ignore

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1200853 - crmd: Resource marked with failcount=INFINITY on start failure with on-fail=ignore

Summary: crmd: Resource marked with failcount=INFINITY on start failure with on-fail=i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	6.6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	6.8
Assignee:	Andrew Beekhof
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:	1200849
Blocks:	1172231
TreeView+	depends on / blocked

Reported:	2015-03-11 14:03 UTC by John Ruemker
Modified:	2019-10-10 09:40 UTC (History)
CC List:	10 users (show)
Fixed In Version:	pacemaker-1.1.14-1.1.el6
Doc Type:	Release Note
Doc Text:	Pacemaker does not update the fail count when `on-fail=ignore` is used When a resource in a Pacemaker cluster failed to start, Pacemaker updated the resource's last failure time and fail count, even if the `on-fail=ignore` option was used. This could cause unwanted resource migrations. Now, Pacemaker does not update the fail count when "on-fail=ignore" is used. As a result, the failure is displayed in the cluster status output, but is properly ignored and thus does not cause resource migration.
Clone Of:	1200849
Environment:
Last Closed:	2016-05-10 23:51:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1361003	0	None	None	None	Never
Red Hat Product Errata	RHBA-2016:0856	0	normal	SHIPPED_LIVE	pacemaker bug fix and enhancement update	2016-05-10 22:44:25 UTC

Description John Ruemker 2015-03-11 14:03:36 UTC

+++ This bug was initially created as a clone of Bug #1200849 +++

Description of problem: When a resource is set with 'op start on-fail=ignore', pengine does attempt to ignore a start failure, but crmd still sets the failcount to INFINITY which eventually causes pengine to move the resource away anyways.  This effectively means there's no way to ignore start failures.  

Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 6: start hascript_start_0 on jrummy7-1.usersys.redhat.com (local)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_start_0: unknown error (node=jrummy7-1.usersys.redhat.com, call=29, rc=1, cib-update=169, confirmed=true)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by hascript_start_0 'modify' on jrummy7-1.usersys.redhat.com: Event failed (magic=0:1;6:97:0:9a793c1c-8193-4258-88cd-7f3d9ca48848, cib=0.17.1, source=match_graph_event:350, 0)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 97 (Complete=1, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3077.bz2): Stopped
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: notice: process_pe_message: Calculated Transition 98: /var/lib/pacemaker/pengine/pe-input-3078.bz2
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 8: monitor hascript_monitor_60000 on jrummy7-1.usersys.redhat.com (local)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by status-1-fail-count-hascript, fail-count-hascript=INFINITY: Transient attribute change (create cib=0.17.2, source=te_update_diff:391, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], 0)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_monitor_60000: not running (node=jrummy7-1.usersys.redhat.com, call=30, rc=7, cib-update=171, confirmed=false)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 8 (hascript_monitor_60000) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 7): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 98 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3078.bz2): Complete
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op_failure: Processing failed op monitor for hascript on jrummy7-1.usersys.redhat.com: not running (7)
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: common_apply_stickiness: Forcing hascript away from jrummy7-1.usersys.redhat.com after 1000000 failures (max=1000000)


Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7, although RHEL 7.0 releases are affected as well


How reproducible: Easily


Steps to Reproduce:
1. Create a resource that can fail on start, but then still be functional afterwards and return success on monitor.  I use a simple lsb script.

2. Set op start on-fail=ignore:

  # pcs resource update <resource> op start on-fail=ignore

3. Set a location constraint to prefer one node so you can control the conditions to cause it to fail on start on that node, but be successful on the other node (if your test resource requires such a step).

4. Enable resource such that it will fail on start, but will actually be running and monitor returns success.

Actual results: Resource will move to another node


Expected results: Resource stays put as long as monitor operations are successful


Additional info: Issue affects RHEL 6 as well.

--- Additional comment from John Ruemker on 2015-03-11 10:02:58 EDT ---

I couldn't come up with a good solution here.  It seemed best for crmd to avoid setting failcount in the first place if on-fail=ignore, but there wasn't an immediately obvious way to make this setting available to it for the result it was processing.  

If you need me to test anything or provide any additional info, just let me know.

Comment 1 John Ruemker 2015-03-11 14:04:10 UTC

Filing for RHEL 6 as it is affected as well, and customer needs the solution for this release.

Comment 2 Ken Gaillot 2015-07-29 19:14:52 UTC

Fixed upstream as of commit 9470f07.

Comment 10 errata-xmlrpc 2016-05-10 23:51:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0856.html

Note You need to log in before you can comment on or make changes to this bug.