Bug 1937579

Summary: last_failure op never gets cleared if fail-count has been cleared
Product: Red Hat Enterprise Linux 8 Reporter: Reid Wahl <nwahl>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: NEW --- QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: low    
Version: 8.3CC: chaekim, cluster-maint, kgaillot
Target Milestone: rcKeywords: Triaged
Target Release: 8.5   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Reid Wahl 2021-03-11 03:07:48 UTC
Description of problem:

If a resource fails an operation and then the fail-count for that operation is cleared, the last_failure op history object never gets removed through pe__clear_failcount. So neither `pcs resource cleanup` nor a failure timeout expiration causes the last_failure to be removed.

As a result, if the resource has the failure-timeout meta attribute set, messages like the following appear at every transition:

    Mar 10 18:04:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Rescheduling dummy1_monitor_10000 after failure expired on node1


This issue is not known to cause any impact.

-----

Demonstration:

# pcs resource config dummy1
 Resource: dummy1 (class=ocf provider=heartbeat type=Dummy)
  Meta Attrs: failure-timeout=20s migration-threshold=1
  Operations: migrate_from interval=0s timeout=20s (dummy1-migrate_from-interval-0s)
              migrate_to interval=0s timeout=20s (dummy1-migrate_to-interval-0s)
              monitor interval=10s timeout=20s (dummy1-monitor-interval-10s)
              start interval=0s timeout=20s (dummy1-start-interval-0s)
              stop interval=0s timeout=20s (dummy1-stop-interval-0s)

Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-attrd[45841]: notice: Setting fail-count-dummy1#monitor_10000[node1]: (unset) -> 1
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-attrd[45841]: notice: Setting last-failure-dummy1#monitor_10000[node1]: 1615428241 -> 1615428632
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: warning: Unexpected result (error) was recorded for monitor of dummy1 on node1 at Mar 10 18:10:32 2021
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice:  * Recover    dummy1      (     node1 )
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Calculated transition 19, saving inputs in /var/lib/pacemaker/pengine/pe-input-2736.bz2
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: warning: Unexpected result (error) was recorded for monitor of dummy1 on node1 at Mar 10 18:10:32 2021
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: warning: Forcing dummy1 away from node1 after 1 failures (max=1)
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice:  * Stop       dummy1      (     node1 )   due to node availability
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Calculated transition 20, saving inputs in /var/lib/pacemaker/pengine/pe-input-2737.bz2
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Initiating stop operation dummy1_stop_0 locally on node1
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Result of stop operation for dummy1 on node1: ok
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Transition 20 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2737.bz2): Complete
Mar 10 18:10:32 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE

# # Here is where I manually cleared the fail-count attribute
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-attrd[45841]: notice: Setting fail-count-dummy1#monitor_10000[node1]: 1 -> (unset)
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: warning: Unexpected result (error) was recorded for monitor of dummy1 on node1 at Mar 10 18:10:32 2021
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice:  * Start      dummy1      (     node1 )
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Calculated transition 21, saving inputs in /var/lib/pacemaker/pengine/pe-input-2738.bz2
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Initiating start operation dummy1_start_0 locally on node1
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Result of start operation for dummy1 on node1: ok
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Initiating monitor operation dummy1_monitor_10000 locally on node1
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Result of monitor operation for dummy1 on node1: ok
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: Transition 21 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2738.bz2): Complete
Mar 10 18:10:44 fastvm-rhel-8-0-23 pacemaker-controld[45843]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE


# # Thereafter, every transition contains a "Rescheduling <op> after failure expired" message
# # (with a 30-second cluster-recheck-interval below)
Mar 10 18:11:14 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Rescheduling dummy1_monitor_10000 after failure expired on node1
Mar 10 18:11:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[45842]: notice: Rescheduling dummy1_monitor_10000 after failure expired on node1
...


The CIB contains the following:

            <lrm_rsc_op id="dummy1_last_failure_0" operation_key="dummy1_monitor_10000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.4.1" transition-key="11:6:0:ec20383e-c05a-46bf-8269-2975f953f0d1" transition-magic="0:1;11:6:0:ec20383e-c05a-46bf-8269-2975f953f0d1" exit-reason="" on_node="node1" call-id="24" rc-code="1" op-status="0" interval="10000" last-rc-change="1615428632" exec-time="0" queue-time="0" op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>

      <transient_attributes id="1">
        <instance_attributes id="status-1">
          <nvpair id="status-1-last-failure-dummy1.monitor_10000" name="last-failure-dummy1#monitor_10000" value="1615428632"/>
        </instance_attributes>
      </transient_attributes>


Running `pcs resource cleanup` does **not** cause the last_failure op history object to be removed. The failure is also not cleaned up automatically when the failure-timeout expires.

However, if you re-add the fail-count attribute (e.g., `crm_attribute --type status -n 'fail-count-dummy1#monitor_10000' -v 1 -N node1`) and then run `pcs resource cleanup` again, it **does** remove the last_failure op history item.

-----

Version-Release number of selected component (if applicable):

pacemaker-2.0.4-6.el8_3.1

-----

How reproducible:

Always

-----

Steps to Reproduce:

1. Allow a resource operation to fail. I used a recurring monitor. If you want to observe the repeated "Rescheduling" message, set the resource's failure-timeout meta attribute.

2. Delete the fail-count attribute on the node where the failure occurred. For example:

    # crm_attribute --type status -n 'fail-count-dummy1#monitor_10000' -D -N node1

3. Run `pcs resource cleanup`.

-----

Actual results:

The last-failure attribute and the last_failure op history object both still exist in the CIB. If the failure-timeout meta attribute is set, then a "Rescheduling <op> after failure expired" message appears in pacemaker.log during every transition after the failure-timeout has expired.

-----

Expected results:

The last-failure attribute and the last_failure op history object have been cleared from the CIB. There are no repetitive "Rescheduling" messages in the logs.

-----

Additional info:

`pcs resource cleanup` does remove the last-failure **attribute** but not the last_failure **op history**.

`pcs resource cleanup` doesn't remove the last_failure because crm_resource --cleanup relies on data_set->failed, which is empty in this case.

The last_failure doesn't get removed automatically when the failure-timeout expires because check_operation_expiry calls pe__clear_failcount for an expired op **only if** pe_get_failcount finds a failcount or remote_reconnect_ms is set. In this case, pe_get_failcount doesn't find a failcount since the attribute has been cleared.

Comment 2 Ken Gaillot 2021-03-11 14:53:14 UTC
crm_resource --cleanup should remove last failures without any fail count, but it currently doesn't.

The workaround is to use --refresh (aka pcs resource refresh).

Comment 3 Reid Wahl 2021-03-11 20:00:50 UTC
(In reply to Ken Gaillot from comment #2)
> crm_resource --cleanup should remove last failures without any fail count,
> but it currently doesn't.
> 
> The workaround is to use --refresh (aka pcs resource refresh).

Am I missing something?

[root@fastvm-rhel-8-0-23 pacemaker]# pcs cluster cib | grep dummy1_last_failure
            <lrm_rsc_op id="dummy1_last_failure_0" operation_key="dummy1_monitor_10000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.7.2" transition-key="12:44:0:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" transition-magic="0:1;12:44:0:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" exit-reason="" on_node="node1" call-id="98" rc-code="1" op-status="0" interval="10000" last-rc-change="1615492659" exec-time="0" queue-time="0" op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>

[root@fastvm-rhel-8-0-23 pacemaker]# pcs resource refresh
Waiting for 1 reply from the controller
... got reply (done)

[root@fastvm-rhel-8-0-23 pacemaker]# pcs cluster cib | grep dummy1_last_failure
            <lrm_rsc_op id="dummy1_last_failure_0" operation_key="dummy1_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.7.2" transition-key="4:96:7:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" transition-magic="0:0;4:96:7:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40" exit-reason="" on_node="node1" call-id="120" rc-code="0" op-status="0" interval="0" last-rc-change="1615492808" last-run="1615492808" exec-time="14" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>

Comment 4 Ken Gaillot 2021-03-11 20:09:20 UTC
(In reply to Reid Wahl from comment #3)
> (In reply to Ken Gaillot from comment #2)
> > crm_resource --cleanup should remove last failures without any fail count,
> > but it currently doesn't.
> > 
> > The workaround is to use --refresh (aka pcs resource refresh).
> 
> Am I missing something?

Yes

> [root@fastvm-rhel-8-0-23 pacemaker]# pcs cluster cib | grep
> dummy1_last_failure
>             <lrm_rsc_op id="dummy1_last_failure_0"
> operation_key="dummy1_monitor_10000" operation="monitor"
                                ^^^^^
> crm-debug-origin="build_active_RAs" crm_feature_set="3.7.2"
> transition-key="12:44:0:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40"
> transition-magic="0:1;12:44:0:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40"
> exit-reason="" on_node="node1" call-id="98" rc-code="1" op-status="0"
                                                       ^
> interval="10000" last-rc-change="1615492659" exec-time="0" queue-time="0"
                                   ^^^^^^^^^^
> op-digest="4811cef7f7f94e3a35a70be7916cb2fd"/>
> 
> [root@fastvm-rhel-8-0-23 pacemaker]# pcs resource refresh
> Waiting for 1 reply from the controller
> ... got reply (done)
> 
> [root@fastvm-rhel-8-0-23 pacemaker]# pcs cluster cib | grep
> dummy1_last_failure
>             <lrm_rsc_op id="dummy1_last_failure_0"
> operation_key="dummy1_monitor_0" operation="monitor"
                                ^
> crm-debug-origin="do_update_resource" crm_feature_set="3.7.2"
> transition-key="4:96:7:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40"
> transition-magic="0:0;4:96:7:7e3f2ca6-9ec9-421a-8bcc-8a81d8da9e40"
> exit-reason="" on_node="node1" call-id="120" rc-code="0" op-status="0"
                                                        ^
> interval="0" last-rc-change="1615492808" last-run="1615492808"
                               ^^^^^^^^^^
> exec-time="14" queue-time="0" op-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>

The probe after the refresh is considered a "failure" since probes are always expected to find "not running". That's somewhat unfortunate but everything works correctly.

Comment 5 Reid Wahl 2021-03-11 22:05:02 UTC
Ahhhh okay :)