Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1983830

Summary:	Resource monitor results are ignored if a failure expires and the recurring operation gets rescheduled before recovery
Product:	Red Hat Enterprise Linux 8	Reporter:	Reid Wahl <nwahl>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED MIGRATED	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	8.4	CC:	cluster-maint
Target Milestone:	beta	Keywords:	MigratedToJIRA, Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-22 19:15:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Reid Wahl 2021-07-19 23:02:52 UTC

Description of problem:

NOTE: This seems highly similar, perhaps identical, to CLBZ#5454:
  - Bug 5454 - Lrmd does not report repeated failures, which may cause non-scheduling problems, when the first failure is not handled due to failure aging or other reasons (https://bugs.clusterlabs.org/show_bug.cgi?id=5454)


If a monitor failure expires (failure-timeout is set and migration-threshold is hit) and a new transition runs (which reschedules the recurring monitor) before the resource stop operation runs, Pacemaker continues running the recurring monitor but ignores the results. The resource remains in Started state despite any future monitor failures. If the monitor operation continues to fail, no action is taken.

In the below configuration, note the dummy_attr resource and its negative colocation constraint. In a two-node cluster, when dummy is scheduled to move due to migration-threshold, dummy_attr is scheduled to move as well (so that they remain on different nodes). dummy_attr's stop operation updates an attribute, which aborts the transition.

Note also stop_delay, which just simulates a resource after dummy in the group that takes a while to stop (like a third-party application).

 Group: testgrp
  Resource: dummy (class=ocf provider=heartbeat type=Dummy)
   Meta Attrs: failure-timeout=10s migration-threshold=1
  Resource: stop_delay (class=ocf provider=heartbeat type=Delay)
   Attributes: mondelay=0 startdelay=0 stopdelay=20

 Resource: dummy_attr (class=ocf provider=pacemaker type=attribute)
  Attributes: name=test

Colocation Constraints:
  dummy_attr with dummy (score:-5000) (id:colocation-dummy_attr-dummy--5000)


Initial state:

  * Resource Group: testgrp:
    * dummy	(ocf::heartbeat:Dummy):	 Started node1
    * stop_delay	(ocf::heartbeat:Delay):	 Started node1
  * dummy_attr	(ocf::pacemaker:attribute):	 Started node2


Now we update the ocf:heartbeat:Dummy resource agent on node 1 so that it always returns 7 for monitor operations:

#monitor)    dummy_monitor;;
monitor)   exit $OCF_NOT_RUNNING;;


Shortly after making the edit, we see the below. The dummy resource fails as expected. We schedule dummy and stop_delay to move over to node 2, and we schedule dummy_attr to move to node 1. The stop_delay stop and the dummy_attr stop run concurrently. dummy_attr stops immediately, causing a transition abort.

Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Result of monitor operation for dummy on node1: not running
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 11 action 13 (dummy_monitor_10000 on node1): expected 'ok' but got 'not running'
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting fail-count-dummy#monitor_10000[node1]: (unset) -> 1
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting last-failure-dummy#monitor_10000[node1]: (unset) -> 1626734349
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: warning: Unexpected result (not running) was recorded for monitor of dummy on node1 at Jul 19 15:39:09 2021
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice:  * Recover    dummy                  (          node1 )
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice:  * Restart    stop_delay             (          node1 )   due to required dummy start
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Calculated transition 16, saving inputs in /var/lib/pacemaker/pengine/pe-input-3651.bz2
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: warning: Unexpected result (not running) was recorded for monitor of dummy on node1 at Jul 19 15:39:09 2021
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: warning: Forcing dummy away from node1 after 1 failures (max=1)
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice:  * Recover    dummy                  ( node1 -> node2 )
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice:  * Move       stop_delay             ( node1 -> node2 )
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice:  * Move       dummy_attr             ( node2 -> node1 )
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Calculated transition 17, saving inputs in /var/lib/pacemaker/pengine/pe-input-3652.bz2
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating stop operation stop_delay_stop_0 locally on node1
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Requesting local execution of stop operation for stop_delay on node1
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating stop operation dummy_attr_stop_0 on node2
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting test[node2]: 1 -> 0
Jul 19 15:39:09 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 17 aborted by status-2-test doing modify test=0: Transient attribute change


20 seconds later, stop_delay finishes stopping. The new transition is calculated. The failure has expired, so the dummy recurring monitor is rescheduled. dummy no longer needs to stop, since the failure is gone. So we start the stop_delay and dummy_attr resources in place and go on about our day.

Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Result of stop operation for stop_delay on node1: ok
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 17 (Complete=3, Pending=0, Fired=0, Skipped=2, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-3652.bz2): Stopped
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Clearing failure of dummy on node1 because it expired
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Clearing failure of dummy on node1 because it expired
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Clearing failure of dummy on node1 because it expired
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Rescheduling dummy_monitor_10000 after failure expired on node1
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice:  * Start      stop_delay             (          node1 )
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice:  * Start      dummy_attr             (          node2 )
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Calculated transition 18, saving inputs in /var/lib/pacemaker/pengine/pe-input-3653.bz2
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating start operation stop_delay_start_0 locally on node1
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Requesting local execution of start operation for stop_delay on node1
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating start operation dummy_attr_start_0 on node2
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting last-failure-dummy#monitor_10000[node1]: 1626734349 -> (unset)
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting fail-count-dummy#monitor_10000[node1]: 1 -> (unset)
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 18 aborted by deletion of lrm_rsc_op[@id='dummy_last_failure_0']: Resource operation removal
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Result of start operation for stop_delay on node1: ok
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-attrd[19222]: notice: Setting test[node2]: 0 -> 1
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 18 (Complete=5, Pending=0, Fired=0, Skipped=2, Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-3653.bz2): Stopped
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Calculated transition 19, saving inputs in /var/lib/pacemaker/pengine/pe-input-3654.bz2
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating monitor operation stop_delay_monitor_10000 locally on node1
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Requesting local execution of monitor operation for stop_delay on node1
Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Initiating monitor operation dummy_attr_monitor_10000 on node2
Jul 19 15:39:30 fastvm-rhel-8-0-23 Delay(stop_delay)[20526]: INFO: Delay is running OK
Jul 19 15:39:30 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Result of monitor operation for stop_delay on node1: ok
Jul 19 15:39:30 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: Transition 19 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3654.bz2): Complete
Jul 19 15:39:30 fastvm-rhel-8-0-23 pacemaker-controld[19224]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE


Here's the problem: I haven't reverted the ocf:heartbeat:Dummy agent on node 1. It's **still** returning 7 on every monitor operation. Pacemaker just doesn't care. The monitor result isn't getting recorded.

This means the monitor will continue to fail indefinitely, and Pacemaker will take no action in response.

-----

Version-Release number of selected component (if applicable):

pacemaker-2.0.5-9.el8_4.1

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Create the dummy, stop_delay, dummy_attr resources and colocation constraints as shown in the BZ description.
2. On the node where testgrp is running, modify /usr/lib/ocf/resource.d/heartbeat/Dummy so that the monitor operation always returns 7.

#monitor)    dummy_monitor;;
monitor)   exit $OCF_NOT_RUNNING;;

-----

Actual results:

The dummy resource's monitor operation fails only once on its initial node. Then the failure expires and the dummy resource goes back to Started state.

The dummy resource's recurring monitor operation continues to fail, but all future monitor failures are ignored/not recorded.

-----

Expected results:

Pacemaker initiates recovery every time the dummy resource's monitor operation fails.

-----

Additional info:

This can affect SAP NetWeaver clusters. In these clusters, the ERS instance resource is negatively colocated with the ASCS instance resource and updates an attribute during start/stop. It was observed with failure-timeout=60s for the ASCS instance resource, when the ASCS instance resource failed and a resource later in the group took 65 seconds to finish stopping.

One can (and likely should) argue that the failure-timeout was too short for that particular configuration. But on the other hand, we encourage customers to use the exact parameter settings given in the documentation for SAP clusters, and this behavior is unexpected. If we show the resource in Started state and continue running the recurring monitor, then we should respond to the result of the recurring monitor.

Comment 1 Reid Wahl 2021-07-19 23:38:40 UTC

When the failure expires and the recurring monitor is rescheduled, the dummy_last_failure_0 object (with rc-code="7") is removed but the dummy_monitor_10000 object (with rc-code="0") remains and never gets updated with a new rc-code, even though the dummy recurring monitor keeps running and failing.

Of course, `pcs resource refresh` cleans out the history and causes Pacemaker to take note of the monitor failures again.

Comment 2 Reid Wahl 2021-07-20 07:42:44 UTC

General FYIs to readers:
  - Even if this is fixed, it may still be necessary to set a higher failure-timeout to prevent a loop of "failure -> expiration -> failure" depending on the transaction details and the operation timings of resources that depend on each other.
  - See also Bug 1658501 (RFE: forced restart of dependant collocated resources), which discusses a related idea of causing a stop sequence to complete after a resource fails and then recovers on its own during stops of dependent resources.

Comment 3 Ken Gaillot 2021-07-20 16:14:50 UTC

As a starting point, I think Pacemaker's current recovery plan when the failure expires (start the other resources in place) is correct. Even if somehow we forced the originally scheduled transition to continue, the failure would then be expired and everything could move back (depending on stickiness etc.).

Also, I agree that the failure-timeout is unreasonably low here. I think we should come up with a formula for a minimum value, and either enforce it or document it (and log a warning if unmet). That should probably be its own bz, or a non-bz upstream improvement.

That leaves the problem that the executor can arrive at a different understanding of a recurring monitor result than the controller/scheduler do. The executor does not report results that are unchanged from the previous run. This relies on the controller/scheduler responding to the initial result with a stop, which will cancel the monitor. If the transition with that stop is aborted, and the stop is not required in the new transition, then we arrive in this condition.

Off the top of my head, a potential fix might be to make monitor cancellations explicit rather than rely on them being implicitly done with stops, and order cancellations before stops. That probably wouldn't eliminate the possibility entirely (the transition might be aborted for external reasons before the cancellations are initiated), but it would make the window very small.

Comment 4 Reid Wahl 2021-07-20 23:54:48 UTC

(In reply to Ken Gaillot from comment #3)

Agreed on the first two points.

I can see cases where it might be useful to force the originally scheduled transition to continue (at least up to the point of recovering the failed resource), but even that might be better handled through a resource agent improvement rather than a Pacemaker parameter or default behavior change. For example, tweak the resource agent so that it cannot return success if it still needs to be restarted for cleanup purposes. It could return a failure or degraded state a degraded state.


> That leaves the problem that the executor can arrive at a different
> understanding of a recurring monitor result than the controller/scheduler
> do. The executor does not report results that are unchanged from the
> previous run. This relies on the controller/scheduler responding to the
> initial result with a stop, which will cancel the monitor. If the transition
> with that stop is aborted, and the stop is not required in the new
> transition, then we arrive in this condition.
> 
> Off the top of my head, a potential fix might be to make monitor
> cancellations explicit rather than rely on them being implicitly done with
> stops, and order cancellations before stops. That probably wouldn't
> eliminate the possibility entirely (the transition might be aborted for
> external reasons before the cancellations are initiated), but it would make
> the window very small.

In this example, the dummy cancellation would have to be ordered before the stop_delay resource's stop. Not sure if that's what you had in mind (order all cancellations before stops), or if you only meant order a dummy cancellation before a dummy stop.

Half-baked thought: For the case of failure expiration on a resource's **current** node, where the migration threshold had been reached -- at the point where we clear the failure, could we request notification from the executor? I.e., tell it, "hey, we need you to report about this operation again"? Something to check on the resource so that we don't keep quietly failing with no report.

That approach might not be general enough even if it's viable. 5454 mentions "or other reasons" besides failure timeout. I'm not sure how well this type of approach would cover the other related scenarios besides failure timeout.

This suggestion is similar to the reporter's suggestion in 5454. Instead of "force_send_notify" being timer-based, it would be based on a request from the controller or scheduler.


Also, depending on how it's done (all the time or only in certain cases), cancelling monitors before stops might change the behavior in BZ 1658501 (for better or for worse). I think it would prevent us from detecting that a resource in the middle of the group has recovered while we're stopping resources farther down the group.

Comment 5 Ken Gaillot 2021-07-21 19:58:51 UTC

(In reply to Reid Wahl from comment #4)
> (In reply to Ken Gaillot from comment #3)
> 
> Agreed on the first two points.
> 
> I can see cases where it might be useful to force the originally scheduled
> transition to continue (at least up to the point of recovering the failed
> resource), but even that might be better handled through a resource agent
> improvement rather than a Pacemaker parameter or default behavior change.
> For example, tweak the resource agent so that it cannot return success if it
> still needs to be restarted for cleanup purposes. It could return a failure
> or degraded state a degraded state.

Agreed, pacemaker shouldn't do it, since the user explicitly asked to ignore failures after a certain time.

 
> > That leaves the problem that the executor can arrive at a different
> > understanding of a recurring monitor result than the controller/scheduler
> > do. The executor does not report results that are unchanged from the
> > previous run. This relies on the controller/scheduler responding to the
> > initial result with a stop, which will cancel the monitor. If the transition
> > with that stop is aborted, and the stop is not required in the new
> > transition, then we arrive in this condition.
> > 
> > Off the top of my head, a potential fix might be to make monitor
> > cancellations explicit rather than rely on them being implicitly done with
> > stops, and order cancellations before stops. That probably wouldn't
> > eliminate the possibility entirely (the transition might be aborted for
> > external reasons before the cancellations are initiated), but it would make
> > the window very small.
> 
> In this example, the dummy cancellation would have to be ordered before the
> stop_delay resource's stop. Not sure if that's what you had in mind (order
> all cancellations before stops), or if you only meant order a dummy
> cancellation before a dummy stop.

I don't think we should order all cancellations before all stops. It wouldn't be difficult; we could create a pseudo-op for "cancellations done", order all cancellations before that, and order all stops after it. But we'd have to deal with corner cases involving remote connections that need to be recovered (the connection restart has to be done before we can ask it to cancel monitors).

I think it would be OK to order cancelling a resource's recurring monitors before stopping that resource. Cancellations wouldn't have any prerequisites (aside from those on remote nodes, which would already be ordered after the connection start), so all cancellations would be first in the transition. In the example here, the cancellation for dummy wouldn't have to wait for the stop of stop_delay (which it currently does since it's part of the dummy stop). I suppose there could be an issue if actions cannot be parallelized, the cancellation for stop_delay gets executed first, and the controller chooses to initiate the stop_delay stop instead of the dummy cancellation next (I'm not sure offhand whether the controller's graph processing could actually do that or not).

I suppose either approach would leave the problem still possible for failed resources on remote nodes whose connection is being recovered, combined with some long-running operation on a cluster node. Hmm ...

> Half-baked thought: For the case of failure expiration on a resource's
> **current** node, where the migration threshold had been reached -- at the
> point where we clear the failure, could we request notification from the
> executor? I.e., tell it, "hey, we need you to report about this operation
> again"? Something to check on the resource so that we don't keep quietly
> failing with no report.

I just noticed this:

    Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice: Rescheduling dummy_monitor_10000 after failure expired on node1

That is intended to prevent issues like this. It breaks the operation digest on the monitor so its gets cancelled and reinitiated (using pe_action_reschedule). However I'm not sure where the cancellation happens, and maybe that's what's going wrong here.

> That approach might not be general enough even if it's viable. 5454 mentions
> "or other reasons" besides failure timeout. I'm not sure how well this type
> of approach would cover the other related scenarios besides failure timeout.
> 
> This suggestion is similar to the reporter's suggestion in 5454. Instead of
> "force_send_notify" being timer-based, it would be based on a request from
> the controller or scheduler.
> 
> 
> Also, depending on how it's done (all the time or only in certain cases),
> cancelling monitors before stops might change the behavior in BZ 1658501
> (for better or for worse). I think it would prevent us from detecting that a
> resource in the middle of the group has recovered while we're stopping
> resources farther down the group.

Comment 6 Reid Wahl 2021-07-21 20:32:00 UTC

(In reply to Ken Gaillot from comment #5)
> I think it would be OK to order cancelling a resource's recurring monitors
> before stopping that resource. Cancellations wouldn't have any prerequisites
> (aside from those on remote nodes, which would already be ordered after the
> connection start), so all cancellations would be first in the transition. In
> the example here, the cancellation for dummy wouldn't have to wait for the
> stop of stop_delay (which it currently does since it's part of the dummy
> stop).

Cool, that addresses my main concern. I thought that if we ordered the dummy cancellation before the dummy stop (and not before the stop_delay stop), we would hit the same issue because the cancellation wouldn't happen before the new transition. If the dummy cancellation runs parallel to the stop_delay stop (subject your next statements), that's not an issue.


> I just noticed this:
> 
>     Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice:
> Rescheduling dummy_monitor_10000 after failure expired on node1
> 
> That is intended to prevent issues like this.

I would have thought so.


> It breaks the operation digest
> on the monitor so its gets cancelled and reinitiated (using
> pe_action_reschedule). However I'm not sure where the cancellation happens,
> and maybe that's what's going wrong here.

I haven't examined closely yet, but here's a debug pastebin of one event:

http://pastebin.test.redhat.com/981195

Comment 7 Ken Gaillot 2021-07-21 21:47:00 UTC

(In reply to Reid Wahl from comment #6)
> > I just noticed this:
> > 
> >     Jul 19 15:39:29 fastvm-rhel-8-0-23 pacemaker-schedulerd[19223]: notice:
> > Rescheduling dummy_monitor_10000 after failure expired on node1
> > 
> > That is intended to prevent issues like this.
> 
> I would have thought so.
> 
> 
> > It breaks the operation digest
> > on the monitor so its gets cancelled and reinitiated (using
> > pe_action_reschedule). However I'm not sure where the cancellation happens,
> > and maybe that's what's going wrong here.
> 
> I haven't examined closely yet, but here's a debug pastebin of one event:
> 
> http://pastebin.test.redhat.com/981195

In those logs, the cancellation does happen:

    Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-execd     [389511] (cancel_recurring_action)       info: Cancelling ocf operation dummy_monitor_10000
    ...
    Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-controld  [389514] (process_lrm_event)     info: Result of monitor operation for dummy on node1: Cancelled | call=131 key=dummy_monitor_10000 confirmed=true

and the monitor is rescheduled:

    Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-controld  [389514] (process_lrm_event)     notice: Result of monitor operation for dummy on node1: not running | rc=7 call=135 key=dummy_monitor_10000 confirmed=false cib-update=350
    ...
    Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[389513] (determine_op_status)   debug: dummy_monitor_10000 on node1: expected 0 (ok), got 7 (not running)

Are you sure this was one of the bad events?

Separately, this is a bit of a red flag:

    Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[389513] (unpack_rsc_op)         notice: Rescheduling dummy_monitor_10000 after failure expired on node1 | actual=193 expected=0 magic=-1:193;11:50:0:df562f33-ab06-469c-9fa1-4bddfe9f25c4

unpack_rsc_op() is wrongly rescheduling monitors if they're pending for longer than failure-timeout (193 is the placeholder exit status for pending). That will be an easy fix.

Comment 8 Reid Wahl 2021-07-21 23:25:29 UTC

(In reply to Ken Gaillot from comment #7)
> Are you sure this was one of the bad events?

Hmm, nope. I'm using the same configuration as far as I can tell (compared pe-input files and everything), but now I'm having a hard time reproducing the issue. I'll see if I can sort it out.

Comment 9 Reid Wahl 2021-07-21 23:44:36 UTC

(In reply to Reid Wahl from comment #8)
> Hmm, nope. I'm using the same configuration as far as I can tell (compared
> pe-input files and everything), but now I'm having a hard time reproducing
> the issue. I'll see if I can sort it out.

Ah, I had a lingering ban constraint that prevented the moves. Sorry.

10-second debug log snippet from a valid test. No cancellation. http://pastebin.test.redhat.com/981211

Comment 10 Ken Gaillot 2021-07-22 00:37:29 UTC

(In reply to Reid Wahl from comment #9)
> (In reply to Reid Wahl from comment #8)
> > Hmm, nope. I'm using the same configuration as far as I can tell (compared
> > pe-input files and everything), but now I'm having a hard time reproducing
> > the issue. I'll see if I can sort it out.
> 
> Ah, I had a lingering ban constraint that prevented the moves. Sorry.
> 
> 10-second debug log snippet from a valid test. No cancellation.
> http://pastebin.test.redhat.com/981211

Comparing the two, the bad case is missing

    Jul 21 13:24:41 fastvm-rhel-8-0-23 pacemaker-schedulerd[389513] (rsc_action_digest_cmp)         info: Parameters to 10000ms-interval monitor action for dummy on node1 changed: hash was calculated-failure-timeout vs. now 4811cef7f7f94e3a35a70be7916cb2fd (restart:3.7.1) 0:7;11:50:0:df562f33-ab06-469c-9fa1-4bddfe9f25c4

In the initial transition, because dummy is *moving* while its failure is being cleared, RecurringOp() checks whether any configured monitor is active on the *new* node, and pe_action_reschedule is never set because there isn't. Instead, it relies on the stop on the old node to cancel the monitor there.

However, the first transition is interrupted before the stop can be done, and the new transition neither needs the stop nor has an expired failure (it was already cleared), so the monitor is neither cancelled nor rescheduled.

I think the fix will be to schedule an explicit monitor cancellation before clearing a recurring monitor failure for failure-timeout. We need to stop or reschedule the monitor, and cancelling it will be needed either way. We just need to make sure there's no ordering loop if the failure is behind a remote connection that may be starting, stopping, restarting, or moving.

Comment 12 RHEL Program Management 2023-09-22 19:15:19 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 13 RHEL Program Management 2023-09-22 19:15:47 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.