Bug 1856015

Summary:

pcmk_monitor_timeout doesn't apply to stonith device start actions [RHEL 8]

Product:

Red Hat Enterprise Linux 8

Reporter:

Reid Wahl <nwahl>

Component:

pacemaker

Assignee:

Ken Gaillot <kgaillot>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

medium

Docs Contact:

Priority:

low

Version:

8.2

CC:

abeekhof, cluster-maint, cluster-qe, jruemker, kgaillot, msmazova, nwahl, sbradley

Target Milestone:

Keywords:

Reopened

Target Release:

8.4

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

pacemaker-2.0.5-1.el8

Doc Type:

Bug Fix

Doc Text:

Cause: Pacemaker's scheduler used the configured start op timeout (if any) for fence device resource starts, but the fencer would use the fence-specific pcmk_monitor_timeout when running a monitor as part of the start. Consequence: The controller could lose track of the start action (eventually timing it out) even if the action completed within the configured pcmk_monitor_timeout. Also, log messages would inconsistent about which timeout was used. Fix: The scheduler now uses pcmk_monitor_timeout as the timeout for fence device start and probe operations. Result: All of pacemaker's components have a consistent timeout for start and probe operations, so actions time out appropriately and log messages are consistent.

Story Points:

---

Clone Of:

1677345

Clones:

1858462 (view as bug list)

Environment:

Last Closed:

2021-05-18 15:26:41 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1677345, 1885645

Bug Blocks:

Attachments:

Description	Flags
Debug logs from test	none

Comment 1 Reid Wahl 2020-07-11 19:48:37 UTC

Created attachment 1700705 [details]
Debug logs from test

PCMK_debug=pacemaker-controld,pacemaker-execd,pacemaker-fenced

 Resource: vmfence (class=stonith type=fence_vmware_soap)
  Attributes: ipaddr=vcenter.gsslab.brq.redhat.com login=gssuser passwd=<redacted> pcmk_monitor_timeout=600s ssl_insecure=1
  Operations: monitor interval=60s (vmfence-monitor-interval-60s)


Added to fence_vmware_soap soap_login():
~~~
    sleep(100)
    fail(EC_LOGIN_DENIED)
~~~

Comment 2 Reid Wahl 2020-07-11 21:08:45 UTC

The lowest sleep value with which I've been able to reproduce the issue is sleep(165) with pcmk_reboot_retries=1 (so one attempt). A sleep(155) with pcmk_reboot_retries=1 doesn't reproduce it. The start action is lost but the stop action completes in time.

start timeout + cluster delay + stop timeout + cluster delay = 60 + 20 + 60 + 20 = 160

I suspect that when the start operation kicks off a monitor, it's async and doesn't die when the start operation times out. Then the stop operation is initiated, but the fencer (or some component) doesn't respond while the monitor operation is still running.

The run with sleep(155) (when stop didn't fail) seems to confirm this. The stop operation didn't begin executing until the monitor operation (spawned by "start") timed out after 155 seconds.

    Jul 11 13:43:28 fastvm-rhel-8-0-23 pacemaker-controld  [2223] (te_rsc_command:321) 	notice: Initiating start operation vmfence_start_0 locally on fastvm-rhel-8-0-23 | action 3
    Jul 11 13:43:28 fastvm-rhel-8-0-23 pacemaker-execd     [2220] (log_execute:134) 	info: executing - rsc:vmfence action:start call_id:76

    Jul 11 13:44:48 fastvm-rhel-8-0-23 pacemaker-controld  [2223] (action_timer_callback:700) 	error: Node fastvm-rhel-8-0-23 did not send start result (via controller) within 20000ms (action timeout plus cluster-delay)
    Jul 11 13:44:48 fastvm-rhel-8-0-23 pacemaker-controld  [2223] (log_synapse_action:207) 	error: [Action    3]: In-flight resource op vmfence_start_0              on fastvm-rhel-8-0-23 (priority: 0, waiting: (null))

    Jul 11 13:44:49 fastvm-rhel-8-0-23 pacemaker-controld  [2223] (te_rsc_command:321) 	notice: Initiating stop operation vmfence_stop_0 locally on fastvm-rhel-8-0-23 | action 1

    Jul 11 13:46:03 fastvm-rhel-8-0-23 pacemaker-fenced    [2219] (operation_finished:374)     debug: fence_vmware_soap_monitor_1:4811 - exited with rc=1
    Jul 11 13:46:03 fastvm-rhel-8-0-23 pacemaker-execd     [2220] (log_execute:134) 	info: executing - rsc:vmfence action:stop call_id:77
    Jul 11 13:46:03 fastvm-rhel-8-0-23 pacemaker-controld  [2223] (process_lrm_event:2783) 	notice: Result of start operation for vmfence on fastvm-rhel-8-0-23: 1 (error) | call=76 key=vmfence_start_0 confirmed=true cib-update=363
    Jul 11 13:46:03 fastvm-rhel-8-0-23 pacemaker-controld  [2223] (process_lrm_event:2783) 	notice: Result of stop operation for vmfence on fastvm-rhel-8-0-23: 0 (ok) | call=77 key=vmfence_stop_0 confirmed=true cib-update=364

Comment 3 Reid Wahl 2020-07-13 07:18:17 UTC

The issue disappears when code is added to Pacemaker to make the action->timeout equal to the pcmk_monitor_timeout (if present), for stonith start operations. I've tried adding that logic to te_rsc_command() and to unpack_action(), but:
  - The op->timeout isn't getting updated. So if a timeout correctly occurs after, say, 40 seconds, the "Result of <op>" message still shows "timeout=20000ms".
  - It's not clear to me why stonith start operations are subject to the 20-second default timeout, while monitor operations are allowed the full pcmk_monitor_timeout. It seems like once we leave scheduler/controld and hit the fenced library, they're processed the same. `pcmk_monitor_timeout` is processed by fenced... I don't see what's going back and updating the action timer to pcmk_monitor_timeout in the mainloop for monitors (and not for starts), if that's indeed happening.

Comment 4 Ken Gaillot 2020-07-16 17:46:30 UTC

A fence device start consists of registering the device with the fencer (which is generally very quick), and if that is successful, a monitor of the device. A fence device stop consists solely of unregistering the device with the fencer (which again is generally very quick).

However the queueing of commands for a fence device is controlled by the global concurrent-fencing property (which defaults to true since 8.1) and by pcmk_action_limit for each device (which defaults to 1). Thus by default a fencing device can only process one action at a time, and all actions are serialized.

Thus it is expected that a monitor is executed as part of the start action. It is also expected that the stop will wait until the monitor completes before being executed.

However I do think two things might be considered a bug here:

1. Should stops be subject to serialization? Since they only involve unregistering the device, there's no communication with the physical device, which is where serialization matters. On the other hand, if there's some action in flight, it may be a bad idea to unregister the device.

2. The time a serialized action is spent waiting for other actions to complete eats into its own timeout. I could see arguments either way on this one, but it makes more sense to me that an action's timeout would only start ticking once it is actually initiated, since it has no control over what it is waiting on.

I have to look more into the start vs monitor configuration. A start includes a monitor, but that's pretty routine even for regular OCF agents, and the user is expected to configure the start timeout with that in mind. However fence devices are unusual, so an exception might be warranted in this case. I'm not sure a configured start or stop operation in the CIB has any effect whatsoever on the fencer. If it does, then it might make sense to just leave it so the user has to configure a start timeout that includes time for a monitor. If it doesn't, then defaulting start timeout to pcmk_monitor_timeout (maybe plus a little bit more for registration) would be better.

Comment 5 Reid Wahl 2020-07-16 18:24:56 UTC

(In reply to Ken Gaillot from comment #4)
> I have to look more into the start vs monitor configuration. A start
> includes a monitor, but that's pretty routine even for regular OCF agents,
> and the user is expected to configure the start timeout with that in mind.
> However fence devices are unusual, so an exception might be warranted in
> this case.

I would agree with that. In an OCF agent, the start operation usually performs some tangible action within the agent to start the managed resource, and it runs a monitor before and/or after performing that operation to make sure the managed resource actually gets started. There is also only **one** place to configure a timeout: the operations (i.e., CRM_meta_timeout).

For stonith devices, the start operation doesn't perform any tangible action other than (1) registering the device and (2) running a monitor. (1) doesn't use the fence agent. (2) has the fencer call the fence agent in the same way as a monitor operation would do. There are **two** places to configure a timeout: the operation timeouts (usually untouched), and pcmk_monitor_timeout (which is what most users use). pcmk_monitor_timeout is applied to the spawned fence monitor, but **not** to the overall start operation. So, the start operation times out while the spawned monitor operation continues to run.


> I'm not sure a configured start or stop operation in the CIB has
> any effect whatsoever on the fencer. If it does, then it might make sense to
> just leave it so the user has to configure a start timeout that includes
> time for a monitor.

IIRC that does work for a start operation.


> If it doesn't, then defaulting start timeout to
> pcmk_monitor_timeout (maybe plus a little bit more for registration) would
> be better.

This is basically what PR#2108 does. IMO, since a start and monitor are basically identical operations from an end-user perspective, the most intuitive and least astonishing approach is to have pcmk_monitor_timeout override the meta timeout for the start and monitor actions.
  - It's already overriding the meta timeout for the monitor action (SOMEHOW -- it's not happening in te_rsc_command AFAICT). So the monitor controld te action already doesn't fail until the fencer's async monitor expires: pcmk_monitor_timeout. This is not (yet) the case for the start action.
  - Even for monitor operations, if the operation does time out after pcmk_monitor_timeout, the "Result of <op>" message uses the meta timeout. So for example, if pcmk_monitor_timeout=120s and the monitor op times out after 120s, there will be a message like "Result of monitor ... Timed Out (timeout=20000)". This is because `construct_op` still uses the meta timeout and doesn't consider pcmk_monitor_timeout.




The general questions about serialized stops and serialized action behavior are interesting to consider, and #2 at least sounds like a bug to me. I'd be more concerned about #1 for the reason you mentioned.

Comment 6 Ken Gaillot 2020-07-17 16:25:16 UTC

OK, so I think we've boiled this down to 2 issues:

* The timeout for fence device start operations should default to the value of pcmk_monitor_timeout (plus 1 or 2 seconds for the device registration)

* When one action on a device is serialized after another due to pcmk_action_limit, the first action's execution time should not eat into the second action's timeout (whether at the controller, executor, or fencer)

Pick one of those to focus this bz on, and open a separate bz for the other.

Comment 7 Ken Gaillot 2020-08-19 13:56:44 UTC

Fix is merged upstream as of https://github.com/ClusterLabs/pacemaker/pull/2108

Comment 11 Markéta Smazová 2020-12-07 10:09:16 UTC

The timeout for fence device start operations should default to the value of pcmk_monitor_timeout.

before fix
-----------

[root@vmware-beta-07 ~]# rpm -q pacemaker
pacemaker-2.0.4-6.el8.x86_64


Configure a fence device "vmfence" and allow it to fully start. Set pcmk_monitor_timeout to 165s.

>   [root@vmware-beta-07 ~]# pcs status
>   Cluster name: STSRHTS23511
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: vmware-beta-07 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Wed Dec  2 15:22:12 2020
>     * Last change:  Wed Dec  2 15:21:16 2020 by root via cibadmin on vmware-beta-07
>     * 2 nodes configured
>     * 1 resource instance configured

>   Node List:
>     * Online: [ vmware-beta-07 vmware-beta-08 ]

>   Full List of Resources:
>     * vmfence	(stonith:fence_vmware_soap):	 Started vmware-beta-07

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

>   [root@vmware-beta-07 ~]# pcs stonith update vmfence pcmk_monitor_timeout=165s

>   [root@vmware-beta-07 ~]# pcs stonith config
>    Resource: vmfence (class=stonith type=fence_vmware_soap)
>     Attributes: ipaddr=vcenter-beta.cluster-qe.lab.eng.brq.redhat.com login=<redacted> passwd=<redacted> pcmk_host_check=static-list pcmk_host_list=vmware-beta-07.cluster-qe.lab.eng.brq.redhat.com,vmware-beta-08.cluster-qe.lab.eng.brq.redhat.com pcmk_host_map=vmware-beta-07.cluster-qe.lab.eng.brq.redhat.com:vmware-beta-07:vmware-beta-08.cluster-qe.lab.eng.brq.redhat.com:vmware-beta-08 pcmk_monitor_timeout=165s ssl=on ssl_insecure=1
>     Operations: monitor interval=60s (vmfence-monitor-interval-60s)


Modify /usr/sbin/fence_vmware_soap as it is described in comment 1.
This leads to a monitor failure and then the stonith device "vmfence" fails to start.

>   [root@vmware-beta-07 ~]# pcs status
>   Cluster name: STSRHTS23511
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: vmware-beta-07 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Wed Dec  2 15:30:51 2020
>     * Last change:  Wed Dec  2 15:24:46 2020 by root via cibadmin on vmware-beta-07
>     * 2 nodes configured
>     * 1 resource instance configured

>   Node List:
>     * Online: [ vmware-beta-07 vmware-beta-08 ]

>   Full List of Resources:
>     * vmfence	(stonith:fence_vmware_soap):	 FAILED vmware-beta-07

>   Failed Resource Actions:
>     * vmfence_monitor_60000 on vmware-beta-07 'OCF_TIMEOUT' (198): call=16, status='Timed Out', exitreason='', last-rc-change='2020-12-02 15:28:01 +01:00', queued=0ms, exec=166227ms

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

>   [root@vmware-beta-07 ~]# pcs status
>   Cluster name: STSRHTS23511
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: vmware-beta-07 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Wed Dec  2 15:32:09 2020
>     * Last change:  Wed Dec  2 15:24:46 2020 by root via cibadmin on vmware-beta-07
>     * 2 nodes configured
>     * 1 resource instance configured

>   Node List:
>     * Online: [ vmware-beta-07 vmware-beta-08 ]

>   Full List of Resources:
>     * vmfence	(stonith:fence_vmware_soap):	 FAILED vmware-beta-07

>   Failed Resource Actions:
>     * vmfence_start_0 on vmware-beta-07 'error' (1): call=-1, status='Timed Out', exitreason='', last-rc-change='2020-12-02 15:32:08 +01:00', queued=0ms, exec=0ms

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled


Both monitor failure and failed "vmfence" start operation are logged with 20-second default timeout instead of pcmk_monitor_timeout (165s).

Log excerpts:

>   Dec 02 15:30:47 vmware-beta-07.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [51816] (process_lrm_event) 	error: Result of monitor operation for vmfence on vmware-beta-07: Timed Out | call=16 key=vmfence_monitor_60000 timeout=20000ms
>   Dec 02 15:33:34 vmware-beta-07.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [51816] (process_lrm_event) 	error: Result of start operation for vmfence on vmware-beta-07: Timed Out | call=23 key=vmfence_start_0 timeout=20000ms



after fix
----------

[root@vmware-stable-12 ~]# rpm -q pacemaker
pacemaker-2.0.5-2.el8.x86_64

Configure a fence device "vmfence" and allow it to fully start. Set pcmk_monitor_timeout to 165s.

>   [root@vmware-stable-12 ~]# pcs status
>   Cluster name: vmware-brq-stable-06
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com (version 2.0.5-2.el8-31aa4f5515) - partition with quorum
>     * Last updated: Thu Nov 26 14:11:40 2020
>     * Last change:  Thu Nov 26 14:11:30 2020 by root via cibadmin on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com ]

>   Full List of Resources:
>     * fence_vmware_rest	(stonith:fence_vmware_rest):	 Started vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com
>     * vmfence	(stonith:fence_vmware_soap):	 Started vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

>   [root@vmware-stable-12 ~]# pcs stonith update vmfence pcmk_monitor_timeout=165s

>   [root@vmware-stable-12 ~]# pcs stonith config vmfence
>    Resource: vmfence (class=stonith type=fence_vmware_soap)
>     Attributes: ipaddr=vcenter-stable.cluster-qe.lab.eng.brq.redhat.com login=<redacted> passwd=<redacted> pcmk_host_check=static-list pcmk_host_list=vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com,vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com pcmk_host_map=vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com:vmware-stable-11:vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com:vmware-stable-12 pcmk_monitor_timeout=165s ssl=on ssl_insecure=1
>     Operations: monitor interval=60s (vmfence-monitor-interval-60s)


Modify /usr/sbin/fence_vmware_soap as it is described in comment 1.
This leads to a monitor failure and then the stonith device "vmfence" fails to start.

>   [root@vmware-stable-12 ~]# pcs status
>   Cluster name: vmware-brq-stable-06
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com (version 2.0.5-2.el8-31aa4f5515) - partition with quorum
>     * Last updated: Thu Nov 26 14:27:48 2020
>     * Last change:  Thu Nov 26 14:18:31 2020 by root via cibadmin on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com ]

>   Full List of Resources:
>     * fence_vmware_rest	(stonith:fence_vmware_rest):	 Started vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com
>     * vmfence	(stonith:fence_vmware_soap):	 Started vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com

>   Failed Resource Actions:
>     * vmfence_start_0 on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com 'OCF_TIMEOUT' (198): call=28, status='Timed Out', exitreason='', last-rc-change='2020-11-26 14:24:35 +01:00', queued=0ms, exec=166165ms

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled


Both monitor failure and failed "vmfence" start operation are logged with pcmk_monitor_timeout (165s)

Log excerpts:

>   Nov 26 14:24:35 vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [76149] (process_lrm_event) 	error: Result of monitor operation for vmfence on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com: Timed Out | call=20 key=vmfence_monitor_60000 timeout=165000ms

>   Nov 26 14:27:21 vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [76149] (process_lrm_event) 	error: Result of start operation for vmfence on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com: Timed Out | call=28 key=vmfence_start_0 timeout=165000ms


marking verified in pacemaker-2.0.5-2.el8

Comment 13 errata-xmlrpc 2021-05-18 15:26:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782