Description of problem: If a resource fails an operation and then the fail-count for that operation is cleared, the last_failure op history object never gets removed through pe__clear_failcount. So neither `pcs resource cleanup` nor a failure timeout expiration causes the last_failure to be removed. As a result, if the resource has the failure-timeout meta attribute set, messages like the following appear at every transition: Mar 10 18:04:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Rescheduling dummy1_monitor_10000 after failure expired on node1 This issue is not known to cause any impact. ----- Demonstration: # pcs resource config dummy1 Resource: dummy1 (class=ocf provider=heartbeat type=Dummy) Meta Attrs: failure-timeout=20s migration-threshold=1 Operations: migrate_from interval=0s timeout=20s (dummy1-migrate_from-interval-0s) migrate_to interval=0s timeout=20s (dummy1-migrate_to-interval-0s) monitor interval=10s timeout=20s (dummy1-monitor-interval-10s) start interval=0s timeout=20s (dummy1-start-interval-0s) stop interval=0s timeout=20s (dummy1-stop-interval-0s) Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-attrd[45841]: notice: Setting fail-count-dummy1#monitor_10000[node1]: (unset) -> 1 Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-attrd[45841]: notice: Setting last-failure-dummy1#monitor_10000[node1]: 1615428241 -> 1615428632 Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: warning: Unexpected result (error) was recorded for monitor of dummy1 on node1 at Mar 10 18:10:32 2021 Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: * Recover dummy1 ( node1 ) Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Calculated transition 19, saving inputs in /var/lib/pacemaker/pengine/pe-input-2736.bz2 Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: warning: Unexpected result (error) was recorded for monitor of dummy1 on node1 at Mar 10 18:10:32 2021 Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: warning: Forcing dummy1 away from node1 after 1 failures (max=1) Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: * Stop dummy1 ( node1 ) due to node availability Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Calculated transition 20, saving inputs in /var/lib/pacemaker/pengine/pe-input-2737.bz2 Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Initiating stop operation dummy1_stop_0 locally on node1 Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Result of stop operation for dummy1 on node1: ok Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Transition 20 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2737.bz2): Complete Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE # # Here is where I manually cleared the fail-count attribute Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-attrd[45841]: notice: Setting fail-count-dummy1#monitor_10000[node1]: 1 -> (unset) Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: State transition S_IDLE -> S_POLICY_ENGINE Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: warning: Unexpected result (error) was recorded for monitor of dummy1 on node1 at Mar 10 18:10:32 2021 Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: * Start dummy1 ( node1 ) Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Calculated transition 21, saving inputs in /var/lib/pacemaker/pengine/pe-input-2738.bz2 Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Initiating start operation dummy1_start_0 locally on node1 Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Result of start operation for dummy1 on node1: ok Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Initiating monitor operation dummy1_monitor_10000 locally on node1 Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Result of monitor operation for dummy1 on node1: ok Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Transition 21 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2738.bz2): Complete Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE # # Thereafter, every transition contains a "Rescheduling <op> after failure expired" message # # (with a 30-second cluster-recheck-interval below) Mar 10 18:11:14 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Rescheduling dummy1_monitor_10000 after failure expired on node1 Mar 10 18:11:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Rescheduling dummy1_monitor_10000 after failure expired on node1 ... The CIB contains the following: <lrm_rsc_op id="dummy1_last_failure_0" operation_key="dummy1_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.4.1" transition-key="11:6:0:ec20383e-c05a-46bf-8269-2975f953f0d1" transition-magic="0:1;11:6:0:ec20383e-c05a-46bf-8269-2975f953f0d1" exit-reason="" on_node="node1" call-id="24" rc-code="1" op-status="0" interval="10000" last-rc-change="1615428632" exec-time="0" queue-time="0" op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/> <transient_attributes id="1"> <instance_attributes id="status-1"> <nvpair id="status-1-last-failure-dummy1.monitor_10000" name="last-failure-dummy1#monitor_10000" value="1615428632"/> </instance_attributes> </transient_attributes> Running `pcs resource cleanup` does **not** cause the last_failure op history object to be removed. The failure is also not cleaned up automatically when the failure-timeout expires. However, if you re-add the fail-count attribute (e.g., `crm_attribute --type status -n 'fail-count-dummy1#monitor_10000' -v 1 -N node1`) and then run `pcs resource cleanup` again, it **does** remove the last_failure op history item. ----- Version-Release number of selected component (if applicable): pacemaker-2.0.4-6.el8_3.1 ----- How reproducible: Always ----- Steps to Reproduce: 1. Allow a resource operation to fail. I used a recurring monitor. If you want to observe the repeated "Rescheduling" message, set the resource's failure-timeout meta attribute. 2. Delete the fail-count attribute on the node where the failure occurred. For example: # crm_attribute --type status -n 'fail-count-dummy1#monitor_10000' -D -N node1 3. Run `pcs resource cleanup`. ----- Actual results: The last-failure attribute and the last_failure op history object both still exist in the CIB. If the failure-timeout meta attribute is set, then a "Rescheduling <op> after failure expired" message appears in pacemaker.log during every transition after the failure-timeout has expired. ----- Expected results: The last-failure attribute and the last_failure op history object have been cleared from the CIB. There are no repetitive "Rescheduling" messages in the logs. ----- Additional info: `pcs resource cleanup` does remove the last-failure **attribute** but not the last_failure **op history**. `pcs resource cleanup` doesn't remove the last_failure because crm_resource --cleanup relies on data_set->failed, which is empty in this case. The last_failure doesn't get removed automatically when the failure-timeout expires because check_operation_expiry calls pe__clear_failcount for an expired op **only if** pe_get_failcount finds a failcount or remote_reconnect_ms is set. In this case, pe_get_failcount doesn't find a failcount since the attribute has been cleared.
crm_resource --cleanup should remove last failures without any fail count, but it currently doesn't. The workaround is to use --refresh (aka pcs resource refresh).
(In reply to Ken Gaillot from comment #2) > crm_resource --cleanup should remove last failures without any fail count, > but it currently doesn't. > > The workaround is to use --refresh (aka pcs resource refresh). Am I missing something? [root@fastvm-rhel-8-0-23 pacemaker]# pcs cluster cib | grep dummy1_last_failure <lrm_rsc_op id="dummy1_last_failure_0" operation_key="dummy1_monitor_10000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.7.2" transition-key="12:44:0:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" transition-magic="0:1;12:44:0:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" exit-reason="" on_node="node1" call-id="98" rc-code="1" op-status="0" interval="10000" last-rc-change="1615492659" exec-time="0" queue-time="0" op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/> [root@fastvm-rhel-8-0-23 pacemaker]# pcs resource refresh Waiting for 1 reply from the controller ... got reply (done) [root@fastvm-rhel-8-0-23 pacemaker]# pcs cluster cib | grep dummy1_last_failure <lrm_rsc_op id="dummy1_last_failure_0" operation_key="dummy1_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.7.2" transition-key="4:96:7:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" transition-magic="0:0;4:96:7:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" exit-reason="" on_node="node1" call-id="120" rc-code="0" op-status="0" interval="0" last-rc-change="1615492808" last-run="1615492808" exec-time="14" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
(In reply to Reid Wahl from comment #3) > (In reply to Ken Gaillot from comment #2) > > crm_resource --cleanup should remove last failures without any fail count, > > but it currently doesn't. > > > > The workaround is to use --refresh (aka pcs resource refresh). > > Am I missing something? Yes > [root@fastvm-rhel-8-0-23 pacemaker]# pcs cluster cib | grep > dummy1_last_failure > <lrm_rsc_op id="dummy1_last_failure_0" > operation_key="dummy1_monitor_10000" operation="monitor" ^^^^^ > crm-debug-origin="build_active_RAs" crm_feature_set="3.7.2" > transition-key="12:44:0:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" > transition-magic="0:1;12:44:0:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" > exit-reason="" on_node="node1" call-id="98" rc-code="1" op-status="0" ^ > interval="10000" last-rc-change="1615492659" exec-time="0" queue-time="0" ^^^^^^^^^^ > op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/> > > [root@fastvm-rhel-8-0-23 pacemaker]# pcs resource refresh > Waiting for 1 reply from the controller > ... got reply (done) > > [root@fastvm-rhel-8-0-23 pacemaker]# pcs cluster cib | grep > dummy1_last_failure > <lrm_rsc_op id="dummy1_last_failure_0" > operation_key="dummy1_monitor_0" operation="monitor" ^ > crm-debug-origin="do_update_resource" crm_feature_set="3.7.2" > transition-key="4:96:7:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" > transition-magic="0:0;4:96:7:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" > exit-reason="" on_node="node1" call-id="120" rc-code="0" op-status="0" ^ > interval="0" last-rc-change="1615492808" last-run="1615492808" ^^^^^^^^^^ > exec-time="14" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/> The probe after the refresh is considered a "failure" since probes are always expected to find "not running". That's somewhat unfortunate but everything works correctly.
Ahhhh okay :)