Bug 1200853 - crmd: Resource marked with failcount=INFINITY on start failure with on-fail=ignore
Summary: crmd: Resource marked with failcount=INFINITY on start failure with on-fail=i...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: pacemaker
Version: 6.6
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: 6.8
Assignee: Andrew Beekhof
QA Contact: cluster-qe@redhat.com
Steven J. Levine
URL:
Whiteboard:
Keywords:
Depends On: 1200849
Blocks: 1172231
TreeView+ depends on / blocked
 
Reported: 2015-03-11 14:03 UTC by John Ruemker
Modified: 2016-05-10 23:51 UTC (History)
10 users (show)

(edit)
Pacemaker does not update the fail count when `on-fail=ignore` is used

When a resource in a Pacemaker cluster failed to start, Pacemaker updated the resource's last failure time and fail count, even if the `on-fail=ignore` option was used. This could cause unwanted resource migrations. Now, Pacemaker does not update the fail count when "on-fail=ignore" is used. As a result, the failure is displayed in the cluster status output, but is properly ignored and thus does not cause resource migration.
Clone Of: 1200849
(edit)
Last Closed: 2016-05-10 23:51:24 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0856 normal SHIPPED_LIVE pacemaker bug fix and enhancement update 2016-05-10 22:44:25 UTC
Red Hat Knowledge Base (Solution) 1361003 None None None Never

Description John Ruemker 2015-03-11 14:03:36 UTC
+++ This bug was initially created as a clone of Bug #1200849 +++

Description of problem: When a resource is set with 'op start on-fail=ignore', pengine does attempt to ignore a start failure, but crmd still sets the failcount to INFINITY which eventually causes pengine to move the resource away anyways.  This effectively means there's no way to ignore start failures.  

Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 6: start hascript_start_0 on jrummy7-1.usersys.redhat.com (local)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_start_0: unknown error (node=jrummy7-1.usersys.redhat.com, call=29, rc=1, cib-update=169, confirmed=true)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by hascript_start_0 'modify' on jrummy7-1.usersys.redhat.com: Event failed (magic=0:1;6:97:0:9a793c1c-8193-4258-88cd-7f3d9ca48848, cib=0.17.1, source=match_graph_event:350, 0)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 97 (Complete=1, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3077.bz2): Stopped
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: notice: process_pe_message: Calculated Transition 98: /var/lib/pacemaker/pengine/pe-input-3078.bz2
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 8: monitor hascript_monitor_60000 on jrummy7-1.usersys.redhat.com (local)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by status-1-fail-count-hascript, fail-count-hascript=INFINITY: Transient attribute change (create cib=0.17.2, source=te_update_diff:391, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], 0)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_monitor_60000: not running (node=jrummy7-1.usersys.redhat.com, call=30, rc=7, cib-update=171, confirmed=false)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 8 (hascript_monitor_60000) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 7): Error
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566)
Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 98 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3078.bz2): Complete
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op_failure: Processing failed op monitor for hascript on jrummy7-1.usersys.redhat.com: not running (7)
Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: common_apply_stickiness: Forcing hascript away from jrummy7-1.usersys.redhat.com after 1000000 failures (max=1000000)


Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7, although RHEL 7.0 releases are affected as well


How reproducible: Easily


Steps to Reproduce:
1. Create a resource that can fail on start, but then still be functional afterwards and return success on monitor.  I use a simple lsb script.

2. Set op start on-fail=ignore:

  # pcs resource update <resource> op start on-fail=ignore

3. Set a location constraint to prefer one node so you can control the conditions to cause it to fail on start on that node, but be successful on the other node (if your test resource requires such a step).

4. Enable resource such that it will fail on start, but will actually be running and monitor returns success.

Actual results: Resource will move to another node


Expected results: Resource stays put as long as monitor operations are successful


Additional info: Issue affects RHEL 6 as well.

--- Additional comment from John Ruemker on 2015-03-11 10:02:58 EDT ---

I couldn't come up with a good solution here.  It seemed best for crmd to avoid setting failcount in the first place if on-fail=ignore, but there wasn't an immediately obvious way to make this setting available to it for the result it was processing.  

If you need me to test anything or provide any additional info, just let me know.

Comment 1 John Ruemker 2015-03-11 14:04:10 UTC
Filing for RHEL 6 as it is affected as well, and customer needs the solution for this release.

Comment 2 Ken Gaillot 2015-07-29 19:14:52 UTC
Fixed upstream as of commit 9470f07.

Comment 10 errata-xmlrpc 2016-05-10 23:51:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0856.html


Note You need to log in before you can comment on or make changes to this bug.