Bug 1983830
| Summary: | Resource monitor results are ignored if a failure expires and the recurring operation gets rescheduled before recovery | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | NEW --- | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 8.4 | CC: | cluster-maint |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Reid Wahl
2021-07-19 23:02:52 UTC
When the failure expires and the recurring monitor is rescheduled, the dummy_last_failure_0 object (with rc-code="7") is removed but the dummy_monitor_10000 object (with rc-code="0") remains and never gets updated with a new rc-code, even though the dummy recurring monitor keeps running and failing. Of course, `pcs resource refresh` cleans out the history and causes Pacemaker to take note of the monitor failures again. General FYIs to readers: - Even if this is fixed, it may still be necessary to set a higher failure-timeout to prevent a loop of "failure -> expiration -> failure" depending on the transaction details and the operation timings of resources that depend on each other. - See also Bug 1658501 (RFE: forced restart of dependant collocated resources), which discusses a related idea of causing a stop sequence to complete after a resource fails and then recovers on its own during stops of dependent resources. As a starting point, I think Pacemaker's current recovery plan when the failure expires (start the other resources in place) is correct. Even if somehow we forced the originally scheduled transition to continue, the failure would then be expired and everything could move back (depending on stickiness etc.). Also, I agree that the failure-timeout is unreasonably low here. I think we should come up with a formula for a minimum value, and either enforce it or document it (and log a warning if unmet). That should probably be its own bz, or a non-bz upstream improvement. That leaves the problem that the executor can arrive at a different understanding of a recurring monitor result than the controller/scheduler do. The executor does not report results that are unchanged from the previous run. This relies on the controller/scheduler responding to the initial result with a stop, which will cancel the monitor. If the transition with that stop is aborted, and the stop is not required in the new transition, then we arrive in this condition. Off the top of my head, a potential fix might be to make monitor cancellations explicit rather than rely on them being implicitly done with stops, and order cancellations before stops. That probably wouldn't eliminate the possibility entirely (the transition might be aborted for external reasons before the cancellations are initiated), but it would make the window very small. (In reply to Ken Gaillot from comment #3) Agreed on the first two points. I can see cases where it might be useful to force the originally scheduled transition to continue (at least up to the point of recovering the failed resource), but even that might be better handled through a resource agent improvement rather than a Pacemaker parameter or default behavior change. For example, tweak the resource agent so that it cannot return success if it still needs to be restarted for cleanup purposes. It could return a failure or degraded state a degraded state. > That leaves the problem that the executor can arrive at a different > understanding of a recurring monitor result than the controller/scheduler > do. The executor does not report results that are unchanged from the > previous run. This relies on the controller/scheduler responding to the > initial result with a stop, which will cancel the monitor. If the transition > with that stop is aborted, and the stop is not required in the new > transition, then we arrive in this condition. > > Off the top of my head, a potential fix might be to make monitor > cancellations explicit rather than rely on them being implicitly done with > stops, and order cancellations before stops. That probably wouldn't > eliminate the possibility entirely (the transition might be aborted for > external reasons before the cancellations are initiated), but it would make > the window very small. In this example, the dummy cancellation would have to be ordered before the stop_delay resource's stop. Not sure if that's what you had in mind (order all cancellations before stops), or if you only meant order a dummy cancellation before a dummy stop. Half-baked thought: For the case of failure expiration on a resource's **current** node, where the migration threshold had been reached -- at the point where we clear the failure, could we request notification from the executor? I.e., tell it, "hey, we need you to report about this operation again"? Something to check on the resource so that we don't keep quietly failing with no report. That approach might not be general enough even if it's viable. 5454 mentions "or other reasons" besides failure timeout. I'm not sure how well this type of approach would cover the other related scenarios besides failure timeout. This suggestion is similar to the reporter's suggestion in 5454. Instead of "force_send_notify" being timer-based, it would be based on a request from the controller or scheduler. Also, depending on how it's done (all the time or only in certain cases), cancelling monitors before stops might change the behavior in BZ 1658501 (for better or for worse). I think it would prevent us from detecting that a resource in the middle of the group has recovered while we're stopping resources farther down the group. (In reply to Reid Wahl from comment #4) > (In reply to Ken Gaillot from comment #3) > > Agreed on the first two points. > > I can see cases where it might be useful to force the originally scheduled > transition to continue (at least up to the point of recovering the failed > resource), but even that might be better handled through a resource agent > improvement rather than a Pacemaker parameter or default behavior change. > For example, tweak the resource agent so that it cannot return success if it > still needs to be restarted for cleanup purposes. It could return a failure > or degraded state a degraded state. Agreed, pacemaker shouldn't do it, since the user explicitly asked to ignore failures after a certain time. > > That leaves the problem that the executor can arrive at a different > > understanding of a recurring monitor result than the controller/scheduler > > do. The executor does not report results that are unchanged from the > > previous run. This relies on the controller/scheduler responding to the > > initial result with a stop, which will cancel the monitor. If the transition > > with that stop is aborted, and the stop is not required in the new > > transition, then we arrive in this condition. > > > > Off the top of my head, a potential fix might be to make monitor > > cancellations explicit rather than rely on them being implicitly done with > > stops, and order cancellations before stops. That probably wouldn't > > eliminate the possibility entirely (the transition might be aborted for > > external reasons before the cancellations are initiated), but it would make > > the window very small. > > In this example, the dummy cancellation would have to be ordered before the > stop_delay resource's stop. Not sure if that's what you had in mind (order > all cancellations before stops), or if you only meant order a dummy > cancellation before a dummy stop. I don't think we should order all cancellations before all stops. It wouldn't be difficult; we could create a pseudo-op for "cancellations done", order all cancellations before that, and order all stops after it. But we'd have to deal with corner cases involving remote connections that need to be recovered (the connection restart has to be done before we can ask it to cancel monitors). I think it would be OK to order cancelling a resource's recurring monitors before stopping that resource. Cancellations wouldn't have any prerequisites (aside from those on remote nodes, which would already be ordered after the connection start), so all cancellations would be first in the transition. In the example here, the cancellation for dummy wouldn't have to wait for the stop of stop_delay (which it currently does since it's part of the dummy stop). I suppose there could be an issue if actions cannot be parallelized, the cancellation for stop_delay gets executed first, and the controller chooses to initiate the stop_delay stop instead of the dummy cancellation next (I'm not sure offhand whether the controller's graph processing could actually do that or not). I suppose either approach would leave the problem still possible for failed resources on remote nodes whose connection is being recovered, combined with some long-running operation on a cluster node. Hmm ... > Half-baked thought: For the case of failure expiration on a resource's > **current** node, where the migration threshold had been reached -- at the > point where we clear the failure, could we request notification from the > executor? I.e., tell it, "hey, we need you to report about this operation > again"? Something to check on the resource so that we don't keep quietly > failing with no report. I just noticed this: Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Rescheduling dummy_monitor_10000 after failure expired on node1 That is intended to prevent issues like this. It breaks the operation digest on the monitor so its gets cancelled and reinitiated (using pe_action_reschedule). However I'm not sure where the cancellation happens, and maybe that's what's going wrong here. > That approach might not be general enough even if it's viable. 5454 mentions > "or other reasons" besides failure timeout. I'm not sure how well this type > of approach would cover the other related scenarios besides failure timeout. > > This suggestion is similar to the reporter's suggestion in 5454. Instead of > "force_send_notify" being timer-based, it would be based on a request from > the controller or scheduler. > > > Also, depending on how it's done (all the time or only in certain cases), > cancelling monitors before stops might change the behavior in BZ 1658501 > (for better or for worse). I think it would prevent us from detecting that a > resource in the middle of the group has recovered while we're stopping > resources farther down the group. (In reply to Ken Gaillot from comment #5) > I think it would be OK to order cancelling a resource's recurring monitors > before stopping that resource. Cancellations wouldn't have any prerequisites > (aside from those on remote nodes, which would already be ordered after the > connection start), so all cancellations would be first in the transition. In > the example here, the cancellation for dummy wouldn't have to wait for the > stop of stop_delay (which it currently does since it's part of the dummy > stop). Cool, that addresses my main concern. I thought that if we ordered the dummy cancellation before the dummy stop (and not before the stop_delay stop), we would hit the same issue because the cancellation wouldn't happen before the new transition. If the dummy cancellation runs parallel to the stop_delay stop (subject your next statements), that's not an issue. > I just noticed this: > > Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: > Rescheduling dummy_monitor_10000 after failure expired on node1 > > That is intended to prevent issues like this. I would have thought so. > It breaks the operation digest > on the monitor so its gets cancelled and reinitiated (using > pe_action_reschedule). However I'm not sure where the cancellation happens, > and maybe that's what's going wrong here. I haven't examined closely yet, but here's a debug pastebin of one event: http://pastebin.test.redhat.com/981195 (In reply to Reid Wahl from comment #6) > > I just noticed this: > > > > Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: > > Rescheduling dummy_monitor_10000 after failure expired on node1 > > > > That is intended to prevent issues like this. > > I would have thought so. > > > > It breaks the operation digest > > on the monitor so its gets cancelled and reinitiated (using > > pe_action_reschedule). However I'm not sure where the cancellation happens, > > and maybe that's what's going wrong here. > > I haven't examined closely yet, but here's a debug pastebin of one event: > > http://pastebin.test.redhat.com/981195 In those logs, the cancellation does happen: Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-execd [389511] (cancel_recurring_action) info: Cancelling ocf operation dummy_monitor_10000 ... Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-controld [389514] (process_lrm_event) info: Result of monitor operation for dummy on node1: Cancelled | call=131 key=dummy_monitor_10000 confirmed=true and the monitor is rescheduled: Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-controld [389514] (process_lrm_event) notice: Result of monitor operation for dummy on node1: not running | rc=7 call=135 key=dummy_monitor_10000 confirmed=false cib-update=350 ... Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[389513] (determine_op_status) debug: dummy_monitor_10000 on node1: expected 0 (ok), got 7 (not running) Are you sure this was one of the bad events? Separately, this is a bit of a red flag: Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[389513] (unpack_rsc_op) notice: Rescheduling dummy_monitor_10000 after failure expired on node1 | actual=193 expected=0 magic=-1:193;11:50:0:df562f33-ab06-469c-9fa1-4bddfe9f25c4 unpack_rsc_op() is wrongly rescheduling monitors if they're pending for longer than failure-timeout (193 is the placeholder exit status for pending). That will be an easy fix. (In reply to Ken Gaillot from comment #7) > Are you sure this was one of the bad events? Hmm, nope. I'm using the same configuration as far as I can tell (compared pe-input files and everything), but now I'm having a hard time reproducing the issue. I'll see if I can sort it out. (In reply to Reid Wahl from comment #8) > Hmm, nope. I'm using the same configuration as far as I can tell (compared > pe-input files and everything), but now I'm having a hard time reproducing > the issue. I'll see if I can sort it out. Ah, I had a lingering ban constraint that prevented the moves. Sorry. 10-second debug log snippet from a valid test. No cancellation. http://pastebin.test.redhat.com/981211 (In reply to Reid Wahl from comment #9) > (In reply to Reid Wahl from comment #8) > > Hmm, nope. I'm using the same configuration as far as I can tell (compared > > pe-input files and everything), but now I'm having a hard time reproducing > > the issue. I'll see if I can sort it out. > > Ah, I had a lingering ban constraint that prevented the moves. Sorry. > > 10-second debug log snippet from a valid test. No cancellation. > http://pastebin.test.redhat.com/981211 Comparing the two, the bad case is missing Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[389513] (rsc_action_digest_cmp) info: Parameters to 10000ms-interval monitor action for dummy on node1 changed: hash was calculated-failure-timeout vs. now 4811cef7f7f94e3a35a70be7916cb2fd (restart:3.7.1) 0:7;11:50:0:df562f33-ab06-469c-9fa1-4bddfe9f25c4 In the initial transition, because dummy is *moving* while its failure is being cleared, RecurringOp() checks whether any configured monitor is active on the *new* node, and pe_action_reschedule is never set because there isn't. Instead, it relies on the stop on the old node to cancel the monitor there. However, the first transition is interrupted before the stop can be done, and the new transition neither needs the stop nor has an expired failure (it was already cleared), so the monitor is neither cancelled nor rescheduled. I think the fix will be to schedule an explicit monitor cancellation before clearing a recurring monitor failure for failure-timeout. We need to stop or reschedule the monitor, and cancelling it will be needed either way. We just need to make sure there's no ordering loop if the failure is behind a remote connection that may be starting, stopping, restarting, or moving. |