Hide Forgot
+++ This bug was initially created as a clone of Bug #1200849 +++ Description of problem: When a resource is set with 'op start on-fail=ignore', pengine does attempt to ignore a start failure, but crmd still sets the failcount to INFINITY which eventually causes pengine to move the resource away anyways. This effectively means there's no way to ignore start failures. Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 6: start hascript_start_0 on jrummy7-1.usersys.redhat.com (local) Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_start_0: unknown error (node=jrummy7-1.usersys.redhat.com, call=29, rc=1, cib-update=169, confirmed=true) Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566) Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by hascript_start_0 'modify' on jrummy7-1.usersys.redhat.com: Event failed (magic=0:1;6:97:0:9a793c1c-8193-4258-88cd-7f3d9ca48848, cib=0.17.1, source=match_graph_event:350, 0) Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566) Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 6 (hascript_start_0) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 1): Error Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566) Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed start: rc=1 (update=INFINITY, time=1426081566) Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 97 (Complete=1, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3077.bz2): Stopped Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded Mar 11 09:46:06 jrummy7-1 pengine[5686]: notice: process_pe_message: Calculated Transition 98: /var/lib/pacemaker/pengine/pe-input-3078.bz2 Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: te_rsc_command: Initiating action 8: monitor hascript_monitor_60000 on jrummy7-1.usersys.redhat.com (local) Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: abort_transition_graph: Transition aborted by status-1-fail-count-hascript, fail-count-hascript=INFINITY: Transient attribute change (create cib=0.17.2, source=te_update_diff:391, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], 0) Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: process_lrm_event: Operation hascript_monitor_60000: not running (node=jrummy7-1.usersys.redhat.com, call=30, rc=7, cib-update=171, confirmed=false) Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: status_from_rc: Action 8 (hascript_monitor_60000) on jrummy7-1.usersys.redhat.com failed (target: 0 vs. rc: 7): Error Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566) Mar 11 09:46:06 jrummy7-1 crmd[5687]: warning: update_failcount: Updating failcount for hascript on jrummy7-1.usersys.redhat.com after failed monitor: rc=7 (update=value++, time=1426081566) Mar 11 09:46:06 jrummy7-1 crmd[5687]: notice: run_graph: Transition 98 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3078.bz2): Complete Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op: Pretending the failure of hascript_start_0 (rc=1) on jrummy7-1.usersys.redhat.com succeeded Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: unpack_rsc_op_failure: Processing failed op monitor for hascript on jrummy7-1.usersys.redhat.com: not running (7) Mar 11 09:46:06 jrummy7-1 pengine[5686]: warning: common_apply_stickiness: Forcing hascript away from jrummy7-1.usersys.redhat.com after 1000000 failures (max=1000000) Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7, although RHEL 7.0 releases are affected as well How reproducible: Easily Steps to Reproduce: 1. Create a resource that can fail on start, but then still be functional afterwards and return success on monitor. I use a simple lsb script. 2. Set op start on-fail=ignore: # pcs resource update <resource> op start on-fail=ignore 3. Set a location constraint to prefer one node so you can control the conditions to cause it to fail on start on that node, but be successful on the other node (if your test resource requires such a step). 4. Enable resource such that it will fail on start, but will actually be running and monitor returns success. Actual results: Resource will move to another node Expected results: Resource stays put as long as monitor operations are successful Additional info: Issue affects RHEL 6 as well. --- Additional comment from John Ruemker on 2015-03-11 10:02:58 EDT --- I couldn't come up with a good solution here. It seemed best for crmd to avoid setting failcount in the first place if on-fail=ignore, but there wasn't an immediately obvious way to make this setting available to it for the result it was processing. If you need me to test anything or provide any additional info, just let me know.
Filing for RHEL 6 as it is affected as well, and customer needs the solution for this release.
Fixed upstream as of commit 9470f07.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0856.html