Description of problem: NOTE: This seems highly similar, perhaps identical, to CLBZ#5454: - Bug 5454 - Lrmd does not report repeated failures, which may cause non-scheduling problems, when the first failure is not handled due to failure aging or other reasons (https://bugs.clusterlabs.org/show_bug.cgi?id=5454) If a monitor failure expires (failure-timeout is set and migration-threshold is hit) and a new transition runs (which reschedules the recurring monitor) before the resource stop operation runs, Pacemaker continues running the recurring monitor but ignores the results. The resource remains in Started state despite any future monitor failures. If the monitor operation continues to fail, no action is taken. In the below configuration, note the dummy_attr resource and its negative colocation constraint. In a two-node cluster, when dummy is scheduled to move due to migration-threshold, dummy_attr is scheduled to move as well (so that they remain on different nodes). dummy_attr's stop operation updates an attribute, which aborts the transition. Note also stop_delay, which just simulates a resource after dummy in the group that takes a while to stop (like a third-party application). Group: testgrp Resource: dummy (class=ocf provider=heartbeat type=Dummy) Meta Attrs: failure-timeout=10s migration-threshold=1 Resource: stop_delay (class=ocf provider=heartbeat type=Delay) Attributes: mondelay=0 startdelay=0 stopdelay=20 Resource: dummy_attr (class=ocf provider=pacemaker type=attribute) Attributes: name=test Colocation Constraints: dummy_attr with dummy (score:-5000) (id:colocation-dummy_attr-dummy--5000) Initial state: * Resource Group: testgrp: * dummy (ocf::heartbeat:Dummy): Started node1 * stop_delay (ocf::heartbeat:Delay): Started node1 * dummy_attr (ocf::pacemaker:attribute): Started node2 Now we update the ocf:heartbeat:Dummy resource agent on node 1 so that it always returns 7 for monitor operations: #monitor) dummy_monitor;; monitor) exit $OCF_NOT_RUNNING;; Shortly after making the edit, we see the below. The dummy resource fails as expected. We schedule dummy and stop_delay to move over to node 2, and we schedule dummy_attr to move to node 1. The stop_delay stop and the dummy_attr stop run concurrently. dummy_attr stops immediately, causing a transition abort. Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Result of monitor operation for dummy on node1: not running Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 11 action 13 (dummy_monitor_10000 on node1): expected 'ok' but got 'not running' Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting fail-count-dummy#monitor_10000[node1]: (unset) -> 1 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting last-failure-dummy#monitor_10000[node1]: (unset) -> 1626734349 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: warning: Unexpected result (not running) was recorded for monitor of dummy on node1 at Jul 19 15:39:09 2021 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: * Recover dummy ( node1 ) Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: * Restart stop_delay ( node1 ) due to required dummy start Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Calculated transition 16, saving inputs in /var/lib/pacemaker/pengine/pe-input-3651.bz2 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: warning: Unexpected result (not running) was recorded for monitor of dummy on node1 at Jul 19 15:39:09 2021 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: warning: Forcing dummy away from node1 after 1 failures (max=1) Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: * Recover dummy ( node1 -> node2 ) Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: * Move stop_delay ( node1 -> node2 ) Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: * Move dummy_attr ( node2 -> node1 ) Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Calculated transition 17, saving inputs in /var/lib/pacemaker/pengine/pe-input-3652.bz2 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating stop operation stop_delay_stop_0 locally on node1 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Requesting local execution of stop operation for stop_delay on node1 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating stop operation dummy_attr_stop_0 on node2 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting test[node2]: 1 -> 0 Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 17 aborted by status-2-test doing modify test=0: Transient attribute change 20 seconds later, stop_delay finishes stopping. The new transition is calculated. The failure has expired, so the dummy recurring monitor is rescheduled. dummy no longer needs to stop, since the failure is gone. So we start the stop_delay and dummy_attr resources in place and go on about our day. Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Result of stop operation for stop_delay on node1: ok Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 17 (Complete=3, Pending=0, Fired=0, Skipped=2, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-3652.bz2): Stopped Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Clearing failure of dummy on node1 because it expired Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Clearing failure of dummy on node1 because it expired Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Clearing failure of dummy on node1 because it expired Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Rescheduling dummy_monitor_10000 after failure expired on node1 Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: * Start stop_delay ( node1 ) Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: * Start dummy_attr ( node2 ) Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Calculated transition 18, saving inputs in /var/lib/pacemaker/pengine/pe-input-3653.bz2 Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating start operation stop_delay_start_0 locally on node1 Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Requesting local execution of start operation for stop_delay on node1 Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating start operation dummy_attr_start_0 on node2 Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting last-failure-dummy#monitor_10000[node1]: 1626734349 -> (unset) Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting fail-count-dummy#monitor_10000[node1]: 1 -> (unset) Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 18 aborted by deletion of lrm_rsc_op[@id='dummy_last_failure_0']: Resource operation removal Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Result of start operation for stop_delay on node1: ok Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting test[node2]: 0 -> 1 Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 18 (Complete=5, Pending=0, Fired=0, Skipped=2, Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-3653.bz2): Stopped Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Calculated transition 19, saving inputs in /var/lib/pacemaker/pengine/pe-input-3654.bz2 Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating monitor operation stop_delay_monitor_10000 locally on node1 Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Requesting local execution of monitor operation for stop_delay on node1 Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating monitor operation dummy_attr_monitor_10000 on node2 Jul 19 15:39:30 fastvm-rhel-8-0-23 Delay(stop_delay)[20526]: INFO: Delay is running OK Jul 19 15:39:30 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Result of monitor operation for stop_delay on node1: ok Jul 19 15:39:30 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 19 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3654.bz2): Complete Jul 19 15:39:30 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Here's the problem: I haven't reverted the ocf:heartbeat:Dummy agent on node 1. It's **still** returning 7 on every monitor operation. Pacemaker just doesn't care. The monitor result isn't getting recorded. This means the monitor will continue to fail indefinitely, and Pacemaker will take no action in response. ----- Version-Release number of selected component (if applicable): pacemaker-2.0.5-9.el8_4.1 ----- How reproducible: Always ----- Steps to Reproduce: 1. Create the dummy, stop_delay, dummy_attr resources and colocation constraints as shown in the BZ description. 2. On the node where testgrp is running, modify /usr/lib/ocf/resource.d/heartbeat/Dummy so that the monitor operation always returns 7. #monitor) dummy_monitor;; monitor) exit $OCF_NOT_RUNNING;; ----- Actual results: The dummy resource's monitor operation fails only once on its initial node. Then the failure expires and the dummy resource goes back to Started state. The dummy resource's recurring monitor operation continues to fail, but all future monitor failures are ignored/not recorded. ----- Expected results: Pacemaker initiates recovery every time the dummy resource's monitor operation fails. ----- Additional info: This can affect SAP NetWeaver clusters. In these clusters, the ERS instance resource is negatively colocated with the ASCS instance resource and updates an attribute during start/stop. It was observed with failure-timeout=60s for the ASCS instance resource, when the ASCS instance resource failed and a resource later in the group took 65 seconds to finish stopping. One can (and likely should) argue that the failure-timeout was too short for that particular configuration. But on the other hand, we encourage customers to use the exact parameter settings given in the documentation for SAP clusters, and this behavior is unexpected. If we show the resource in Started state and continue running the recurring monitor, then we should respond to the result of the recurring monitor.
When the failure expires and the recurring monitor is rescheduled, the dummy_last_failure_0 object (with rc-code="7") is removed but the dummy_monitor_10000 object (with rc-code="0") remains and never gets updated with a new rc-code, even though the dummy recurring monitor keeps running and failing. Of course, `pcs resource refresh` cleans out the history and causes Pacemaker to take note of the monitor failures again.
General FYIs to readers: - Even if this is fixed, it may still be necessary to set a higher failure-timeout to prevent a loop of "failure -> expiration -> failure" depending on the transaction details and the operation timings of resources that depend on each other. - See also Bug 1658501 (RFE: forced restart of dependant collocated resources), which discusses a related idea of causing a stop sequence to complete after a resource fails and then recovers on its own during stops of dependent resources.
As a starting point, I think Pacemaker's current recovery plan when the failure expires (start the other resources in place) is correct. Even if somehow we forced the originally scheduled transition to continue, the failure would then be expired and everything could move back (depending on stickiness etc.). Also, I agree that the failure-timeout is unreasonably low here. I think we should come up with a formula for a minimum value, and either enforce it or document it (and log a warning if unmet). That should probably be its own bz, or a non-bz upstream improvement. That leaves the problem that the executor can arrive at a different understanding of a recurring monitor result than the controller/scheduler do. The executor does not report results that are unchanged from the previous run. This relies on the controller/scheduler responding to the initial result with a stop, which will cancel the monitor. If the transition with that stop is aborted, and the stop is not required in the new transition, then we arrive in this condition. Off the top of my head, a potential fix might be to make monitor cancellations explicit rather than rely on them being implicitly done with stops, and order cancellations before stops. That probably wouldn't eliminate the possibility entirely (the transition might be aborted for external reasons before the cancellations are initiated), but it would make the window very small.
(In reply to Ken Gaillot from comment #3) Agreed on the first two points. I can see cases where it might be useful to force the originally scheduled transition to continue (at least up to the point of recovering the failed resource), but even that might be better handled through a resource agent improvement rather than a Pacemaker parameter or default behavior change. For example, tweak the resource agent so that it cannot return success if it still needs to be restarted for cleanup purposes. It could return a failure or degraded state a degraded state. > That leaves the problem that the executor can arrive at a different > understanding of a recurring monitor result than the controller/scheduler > do. The executor does not report results that are unchanged from the > previous run. This relies on the controller/scheduler responding to the > initial result with a stop, which will cancel the monitor. If the transition > with that stop is aborted, and the stop is not required in the new > transition, then we arrive in this condition. > > Off the top of my head, a potential fix might be to make monitor > cancellations explicit rather than rely on them being implicitly done with > stops, and order cancellations before stops. That probably wouldn't > eliminate the possibility entirely (the transition might be aborted for > external reasons before the cancellations are initiated), but it would make > the window very small. In this example, the dummy cancellation would have to be ordered before the stop_delay resource's stop. Not sure if that's what you had in mind (order all cancellations before stops), or if you only meant order a dummy cancellation before a dummy stop. Half-baked thought: For the case of failure expiration on a resource's **current** node, where the migration threshold had been reached -- at the point where we clear the failure, could we request notification from the executor? I.e., tell it, "hey, we need you to report about this operation again"? Something to check on the resource so that we don't keep quietly failing with no report. That approach might not be general enough even if it's viable. 5454 mentions "or other reasons" besides failure timeout. I'm not sure how well this type of approach would cover the other related scenarios besides failure timeout. This suggestion is similar to the reporter's suggestion in 5454. Instead of "force_send_notify" being timer-based, it would be based on a request from the controller or scheduler. Also, depending on how it's done (all the time or only in certain cases), cancelling monitors before stops might change the behavior in BZ 1658501 (for better or for worse). I think it would prevent us from detecting that a resource in the middle of the group has recovered while we're stopping resources farther down the group.
(In reply to Reid Wahl from comment #4) > (In reply to Ken Gaillot from comment #3) > > Agreed on the first two points. > > I can see cases where it might be useful to force the originally scheduled > transition to continue (at least up to the point of recovering the failed > resource), but even that might be better handled through a resource agent > improvement rather than a Pacemaker parameter or default behavior change. > For example, tweak the resource agent so that it cannot return success if it > still needs to be restarted for cleanup purposes. It could return a failure > or degraded state a degraded state. Agreed, pacemaker shouldn't do it, since the user explicitly asked to ignore failures after a certain time. > > That leaves the problem that the executor can arrive at a different > > understanding of a recurring monitor result than the controller/scheduler > > do. The executor does not report results that are unchanged from the > > previous run. This relies on the controller/scheduler responding to the > > initial result with a stop, which will cancel the monitor. If the transition > > with that stop is aborted, and the stop is not required in the new > > transition, then we arrive in this condition. > > > > Off the top of my head, a potential fix might be to make monitor > > cancellations explicit rather than rely on them being implicitly done with > > stops, and order cancellations before stops. That probably wouldn't > > eliminate the possibility entirely (the transition might be aborted for > > external reasons before the cancellations are initiated), but it would make > > the window very small. > > In this example, the dummy cancellation would have to be ordered before the > stop_delay resource's stop. Not sure if that's what you had in mind (order > all cancellations before stops), or if you only meant order a dummy > cancellation before a dummy stop. I don't think we should order all cancellations before all stops. It wouldn't be difficult; we could create a pseudo-op for "cancellations done", order all cancellations before that, and order all stops after it. But we'd have to deal with corner cases involving remote connections that need to be recovered (the connection restart has to be done before we can ask it to cancel monitors). I think it would be OK to order cancelling a resource's recurring monitors before stopping that resource. Cancellations wouldn't have any prerequisites (aside from those on remote nodes, which would already be ordered after the connection start), so all cancellations would be first in the transition. In the example here, the cancellation for dummy wouldn't have to wait for the stop of stop_delay (which it currently does since it's part of the dummy stop). I suppose there could be an issue if actions cannot be parallelized, the cancellation for stop_delay gets executed first, and the controller chooses to initiate the stop_delay stop instead of the dummy cancellation next (I'm not sure offhand whether the controller's graph processing could actually do that or not). I suppose either approach would leave the problem still possible for failed resources on remote nodes whose connection is being recovered, combined with some long-running operation on a cluster node. Hmm ... > Half-baked thought: For the case of failure expiration on a resource's > **current** node, where the migration threshold had been reached -- at the > point where we clear the failure, could we request notification from the > executor? I.e., tell it, "hey, we need you to report about this operation > again"? Something to check on the resource so that we don't keep quietly > failing with no report. I just noticed this: Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Rescheduling dummy_monitor_10000 after failure expired on node1 That is intended to prevent issues like this. It breaks the operation digest on the monitor so its gets cancelled and reinitiated (using pe_action_reschedule). However I'm not sure where the cancellation happens, and maybe that's what's going wrong here. > That approach might not be general enough even if it's viable. 5454 mentions > "or other reasons" besides failure timeout. I'm not sure how well this type > of approach would cover the other related scenarios besides failure timeout. > > This suggestion is similar to the reporter's suggestion in 5454. Instead of > "force_send_notify" being timer-based, it would be based on a request from > the controller or scheduler. > > > Also, depending on how it's done (all the time or only in certain cases), > cancelling monitors before stops might change the behavior in BZ 1658501 > (for better or for worse). I think it would prevent us from detecting that a > resource in the middle of the group has recovered while we're stopping > resources farther down the group.
(In reply to Ken Gaillot from comment #5) > I think it would be OK to order cancelling a resource's recurring monitors > before stopping that resource. Cancellations wouldn't have any prerequisites > (aside from those on remote nodes, which would already be ordered after the > connection start), so all cancellations would be first in the transition. In > the example here, the cancellation for dummy wouldn't have to wait for the > stop of stop_delay (which it currently does since it's part of the dummy > stop). Cool, that addresses my main concern. I thought that if we ordered the dummy cancellation before the dummy stop (and not before the stop_delay stop), we would hit the same issue because the cancellation wouldn't happen before the new transition. If the dummy cancellation runs parallel to the stop_delay stop (subject your next statements), that's not an issue. > I just noticed this: > > Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: > Rescheduling dummy_monitor_10000 after failure expired on node1 > > That is intended to prevent issues like this. I would have thought so. > It breaks the operation digest > on the monitor so its gets cancelled and reinitiated (using > pe_action_reschedule). However I'm not sure where the cancellation happens, > and maybe that's what's going wrong here. I haven't examined closely yet, but here's a debug pastebin of one event: http://pastebin.test.redhat.com/981195
(In reply to Reid Wahl from comment #6) > > I just noticed this: > > > > Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: > > Rescheduling dummy_monitor_10000 after failure expired on node1 > > > > That is intended to prevent issues like this. > > I would have thought so. > > > > It breaks the operation digest > > on the monitor so its gets cancelled and reinitiated (using > > pe_action_reschedule). However I'm not sure where the cancellation happens, > > and maybe that's what's going wrong here. > > I haven't examined closely yet, but here's a debug pastebin of one event: > > http://pastebin.test.redhat.com/981195 In those logs, the cancellation does happen: Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-execd [389511] (cancel_recurring_action) info: Cancelling ocf operation dummy_monitor_10000 ... Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-controld [389514] (process_lrm_event) info: Result of monitor operation for dummy on node1: Cancelled | call=131 key=dummy_monitor_10000 confirmed=true and the monitor is rescheduled: Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-controld [389514] (process_lrm_event) notice: Result of monitor operation for dummy on node1: not running | rc=7 call=135 key=dummy_monitor_10000 confirmed=false cib-update=350 ... Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[389513] (determine_op_status) debug: dummy_monitor_10000 on node1: expected 0 (ok), got 7 (not running) Are you sure this was one of the bad events? Separately, this is a bit of a red flag: Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[389513] (unpack_rsc_op) notice: Rescheduling dummy_monitor_10000 after failure expired on node1 | actual=193 expected=0 magic=-1:193;11:50:0:df562f33-ab06-469c-9fa1-4bddfe9f25c4 unpack_rsc_op() is wrongly rescheduling monitors if they're pending for longer than failure-timeout (193 is the placeholder exit status for pending). That will be an easy fix.
(In reply to Ken Gaillot from comment #7) > Are you sure this was one of the bad events? Hmm, nope. I'm using the same configuration as far as I can tell (compared pe-input files and everything), but now I'm having a hard time reproducing the issue. I'll see if I can sort it out.
(In reply to Reid Wahl from comment #8) > Hmm, nope. I'm using the same configuration as far as I can tell (compared > pe-input files and everything), but now I'm having a hard time reproducing > the issue. I'll see if I can sort it out. Ah, I had a lingering ban constraint that prevented the moves. Sorry. 10-second debug log snippet from a valid test. No cancellation. http://pastebin.test.redhat.com/981211
(In reply to Reid Wahl from comment #9) > (In reply to Reid Wahl from comment #8) > > Hmm, nope. I'm using the same configuration as far as I can tell (compared > > pe-input files and everything), but now I'm having a hard time reproducing > > the issue. I'll see if I can sort it out. > > Ah, I had a lingering ban constraint that prevented the moves. Sorry. > > 10-second debug log snippet from a valid test. No cancellation. > http://pastebin.test.redhat.com/981211 Comparing the two, the bad case is missing Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[389513] (rsc_action_digest_cmp) info: Parameters to 10000ms-interval monitor action for dummy on node1 changed: hash was calculated-failure-timeout vs. now 4811cef7f7f94e3a35a70be7916cb2fd (restart:3.7.1) 0:7;11:50:0:df562f33-ab06-469c-9fa1-4bddfe9f25c4 In the initial transition, because dummy is *moving* while its failure is being cleared, RecurringOp() checks whether any configured monitor is active on the *new* node, and pe_action_reschedule is never set because there isn't. Instead, it relies on the stop on the old node to cancel the monitor there. However, the first transition is interrupted before the stop can be done, and the new transition neither needs the stop nor has an expired failure (it was already cleared), so the monitor is neither cancelled nor rescheduled. I think the fix will be to schedule an explicit monitor cancellation before clearing a recurring monitor failure for failure-timeout. We need to stop or reschedule the monitor, and cancelling it will be needed either way. We just need to make sure there's no ordering loop if the failure is behind a remote connection that may be starting, stopping, restarting, or moving.