Bug 1856015
Summary: | pcmk_monitor_timeout doesn't apply to stonith device start actions [RHEL 8] | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> | ||||
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 8.2 | CC: | abeekhof, cluster-maint, cluster-qe, jruemker, kgaillot, msmazova, nwahl, sbradley | ||||
Target Milestone: | rc | Keywords: | Reopened | ||||
Target Release: | 8.4 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | pacemaker-2.0.5-1.el8 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause: Pacemaker's scheduler used the configured start op timeout (if any) for fence device resource starts, but the fencer would use the fence-specific pcmk_monitor_timeout when running a monitor as part of the start.
Consequence: The controller could lose track of the start action (eventually timing it out) even if the action completed within the configured pcmk_monitor_timeout. Also, log messages would inconsistent about which timeout was used.
Fix: The scheduler now uses pcmk_monitor_timeout as the timeout for fence device start and probe operations.
Result: All of pacemaker's components have a consistent timeout for start and probe operations, so actions time out appropriately and log messages are consistent.
|
Story Points: | --- | ||||
Clone Of: | 1677345 | ||||||
: | 1858462 (view as bug list) | Environment: | |||||
Last Closed: | 2021-05-18 15:26:41 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1677345, 1885645 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
The lowest sleep value with which I've been able to reproduce the issue is sleep(165) with pcmk_reboot_retries=1 (so one attempt). A sleep(155) with pcmk_reboot_retries=1 doesn't reproduce it. The start action is lost but the stop action completes in time. start timeout + cluster delay + stop timeout + cluster delay = 60 + 20 + 60 + 20 = 160 I suspect that when the start operation kicks off a monitor, it's async and doesn't die when the start operation times out. Then the stop operation is initiated, but the fencer (or some component) doesn't respond while the monitor operation is still running. The run with sleep(155) (when stop didn't fail) seems to confirm this. The stop operation didn't begin executing until the monitor operation (spawned by "start") timed out after 155 seconds. Jul 11 13:43:28 fastvm-rhel-8-0-23 pacemaker-controld [2223] (te_rsc_command:321) notice: Initiating start operation vmfence_start_0 locally on fastvm-rhel-8-0-23 | action 3 Jul 11 13:43:28 fastvm-rhel-8-0-23 pacemaker-execd [2220] (log_execute:134) info: executing - rsc:vmfence action:start call_id:76 Jul 11 13:44:48 fastvm-rhel-8-0-23 pacemaker-controld [2223] (action_timer_callback:700) error: Node fastvm-rhel-8-0-23 did not send start result (via controller) within 20000ms (action timeout plus cluster-delay) Jul 11 13:44:48 fastvm-rhel-8-0-23 pacemaker-controld [2223] (log_synapse_action:207) error: [Action 3]: In-flight resource op vmfence_start_0 on fastvm-rhel-8-0-23 (priority: 0, waiting: (null)) Jul 11 13:44:49 fastvm-rhel-8-0-23 pacemaker-controld [2223] (te_rsc_command:321) notice: Initiating stop operation vmfence_stop_0 locally on fastvm-rhel-8-0-23 | action 1 Jul 11 13:46:03 fastvm-rhel-8-0-23 pacemaker-fenced [2219] (operation_finished:374) debug: fence_vmware_soap_monitor_1:4811 - exited with rc=1 Jul 11 13:46:03 fastvm-rhel-8-0-23 pacemaker-execd [2220] (log_execute:134) info: executing - rsc:vmfence action:stop call_id:77 Jul 11 13:46:03 fastvm-rhel-8-0-23 pacemaker-controld [2223] (process_lrm_event:2783) notice: Result of start operation for vmfence on fastvm-rhel-8-0-23: 1 (error) | call=76 key=vmfence_start_0 confirmed=true cib-update=363 Jul 11 13:46:03 fastvm-rhel-8-0-23 pacemaker-controld [2223] (process_lrm_event:2783) notice: Result of stop operation for vmfence on fastvm-rhel-8-0-23: 0 (ok) | call=77 key=vmfence_stop_0 confirmed=true cib-update=364 The issue disappears when code is added to Pacemaker to make the action->timeout equal to the pcmk_monitor_timeout (if present), for stonith start operations. I've tried adding that logic to te_rsc_command() and to unpack_action(), but: - The op->timeout isn't getting updated. So if a timeout correctly occurs after, say, 40 seconds, the "Result of <op>" message still shows "timeout=20000ms". - It's not clear to me why stonith start operations are subject to the 20-second default timeout, while monitor operations are allowed the full pcmk_monitor_timeout. It seems like once we leave scheduler/controld and hit the fenced library, they're processed the same. `pcmk_monitor_timeout` is processed by fenced... I don't see what's going back and updating the action timer to pcmk_monitor_timeout in the mainloop for monitors (and not for starts), if that's indeed happening. A fence device start consists of registering the device with the fencer (which is generally very quick), and if that is successful, a monitor of the device. A fence device stop consists solely of unregistering the device with the fencer (which again is generally very quick). However the queueing of commands for a fence device is controlled by the global concurrent-fencing property (which defaults to true since 8.1) and by pcmk_action_limit for each device (which defaults to 1). Thus by default a fencing device can only process one action at a time, and all actions are serialized. Thus it is expected that a monitor is executed as part of the start action. It is also expected that the stop will wait until the monitor completes before being executed. However I do think two things might be considered a bug here: 1. Should stops be subject to serialization? Since they only involve unregistering the device, there's no communication with the physical device, which is where serialization matters. On the other hand, if there's some action in flight, it may be a bad idea to unregister the device. 2. The time a serialized action is spent waiting for other actions to complete eats into its own timeout. I could see arguments either way on this one, but it makes more sense to me that an action's timeout would only start ticking once it is actually initiated, since it has no control over what it is waiting on. I have to look more into the start vs monitor configuration. A start includes a monitor, but that's pretty routine even for regular OCF agents, and the user is expected to configure the start timeout with that in mind. However fence devices are unusual, so an exception might be warranted in this case. I'm not sure a configured start or stop operation in the CIB has any effect whatsoever on the fencer. If it does, then it might make sense to just leave it so the user has to configure a start timeout that includes time for a monitor. If it doesn't, then defaulting start timeout to pcmk_monitor_timeout (maybe plus a little bit more for registration) would be better. (In reply to Ken Gaillot from comment #4) > I have to look more into the start vs monitor configuration. A start > includes a monitor, but that's pretty routine even for regular OCF agents, > and the user is expected to configure the start timeout with that in mind. > However fence devices are unusual, so an exception might be warranted in > this case. I would agree with that. In an OCF agent, the start operation usually performs some tangible action within the agent to start the managed resource, and it runs a monitor before and/or after performing that operation to make sure the managed resource actually gets started. There is also only **one** place to configure a timeout: the operations (i.e., CRM_meta_timeout). For stonith devices, the start operation doesn't perform any tangible action other than (1) registering the device and (2) running a monitor. (1) doesn't use the fence agent. (2) has the fencer call the fence agent in the same way as a monitor operation would do. There are **two** places to configure a timeout: the operation timeouts (usually untouched), and pcmk_monitor_timeout (which is what most users use). pcmk_monitor_timeout is applied to the spawned fence monitor, but **not** to the overall start operation. So, the start operation times out while the spawned monitor operation continues to run. > I'm not sure a configured start or stop operation in the CIB has > any effect whatsoever on the fencer. If it does, then it might make sense to > just leave it so the user has to configure a start timeout that includes > time for a monitor. IIRC that does work for a start operation. > If it doesn't, then defaulting start timeout to > pcmk_monitor_timeout (maybe plus a little bit more for registration) would > be better. This is basically what PR#2108 does. IMO, since a start and monitor are basically identical operations from an end-user perspective, the most intuitive and least astonishing approach is to have pcmk_monitor_timeout override the meta timeout for the start and monitor actions. - It's already overriding the meta timeout for the monitor action (SOMEHOW -- it's not happening in te_rsc_command AFAICT). So the monitor controld te action already doesn't fail until the fencer's async monitor expires: pcmk_monitor_timeout. This is not (yet) the case for the start action. - Even for monitor operations, if the operation does time out after pcmk_monitor_timeout, the "Result of <op>" message uses the meta timeout. So for example, if pcmk_monitor_timeout=120s and the monitor op times out after 120s, there will be a message like "Result of monitor ... Timed Out (timeout=20000)". This is because `construct_op` still uses the meta timeout and doesn't consider pcmk_monitor_timeout. The general questions about serialized stops and serialized action behavior are interesting to consider, and #2 at least sounds like a bug to me. I'd be more concerned about #1 for the reason you mentioned. OK, so I think we've boiled this down to 2 issues: * The timeout for fence device start operations should default to the value of pcmk_monitor_timeout (plus 1 or 2 seconds for the device registration) * When one action on a device is serialized after another due to pcmk_action_limit, the first action's execution time should not eat into the second action's timeout (whether at the controller, executor, or fencer) Pick one of those to focus this bz on, and open a separate bz for the other. Fix is merged upstream as of https://github.com/ClusterLabs/pacemaker/pull/2108 The timeout for fence device start operations should default to the value of pcmk_monitor_timeout. before fix ----------- [root@vmware-beta-07 ~]# rpm -q pacemaker pacemaker-2.0.4-6.el8.x86_64 Configure a fence device "vmfence" and allow it to fully start. Set pcmk_monitor_timeout to 165s. > [root@vmware-beta-07 ~]# pcs status > Cluster name: STSRHTS23511 > Cluster Summary: > * Stack: corosync > * Current DC: vmware-beta-07 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum > * Last updated: Wed Dec 2 15:22:12 2020 > * Last change: Wed Dec 2 15:21:16 2020 by root via cibadmin on vmware-beta-07 > * 2 nodes configured > * 1 resource instance configured > Node List: > * Online: [ vmware-beta-07 vmware-beta-08 ] > Full List of Resources: > * vmfence (stonith:fence_vmware_soap): Started vmware-beta-07 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@vmware-beta-07 ~]# pcs stonith update vmfence pcmk_monitor_timeout=165s > [root@vmware-beta-07 ~]# pcs stonith config > Resource: vmfence (class=stonith type=fence_vmware_soap) > Attributes: ipaddr=vcenter-beta.cluster-qe.lab.eng.brq.redhat.com login=<redacted> passwd=<redacted> pcmk_host_check=static-list pcmk_host_list=vmware-beta-07.cluster-qe.lab.eng.brq.redhat.com,vmware-beta-08.cluster-qe.lab.eng.brq.redhat.com pcmk_host_map=vmware-beta-07.cluster-qe.lab.eng.brq.redhat.com:vmware-beta-07:vmware-beta-08.cluster-qe.lab.eng.brq.redhat.com:vmware-beta-08 pcmk_monitor_timeout=165s ssl=on ssl_insecure=1 > Operations: monitor interval=60s (vmfence-monitor-interval-60s) Modify /usr/sbin/fence_vmware_soap as it is described in comment 1. This leads to a monitor failure and then the stonith device "vmfence" fails to start. > [root@vmware-beta-07 ~]# pcs status > Cluster name: STSRHTS23511 > Cluster Summary: > * Stack: corosync > * Current DC: vmware-beta-07 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum > * Last updated: Wed Dec 2 15:30:51 2020 > * Last change: Wed Dec 2 15:24:46 2020 by root via cibadmin on vmware-beta-07 > * 2 nodes configured > * 1 resource instance configured > Node List: > * Online: [ vmware-beta-07 vmware-beta-08 ] > Full List of Resources: > * vmfence (stonith:fence_vmware_soap): FAILED vmware-beta-07 > Failed Resource Actions: > * vmfence_monitor_60000 on vmware-beta-07 'OCF_TIMEOUT' (198): call=16, status='Timed Out', exitreason='', last-rc-change='2020-12-02 15:28:01 +01:00', queued=0ms, exec=166227ms > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@vmware-beta-07 ~]# pcs status > Cluster name: STSRHTS23511 > Cluster Summary: > * Stack: corosync > * Current DC: vmware-beta-07 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum > * Last updated: Wed Dec 2 15:32:09 2020 > * Last change: Wed Dec 2 15:24:46 2020 by root via cibadmin on vmware-beta-07 > * 2 nodes configured > * 1 resource instance configured > Node List: > * Online: [ vmware-beta-07 vmware-beta-08 ] > Full List of Resources: > * vmfence (stonith:fence_vmware_soap): FAILED vmware-beta-07 > Failed Resource Actions: > * vmfence_start_0 on vmware-beta-07 'error' (1): call=-1, status='Timed Out', exitreason='', last-rc-change='2020-12-02 15:32:08 +01:00', queued=0ms, exec=0ms > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled Both monitor failure and failed "vmfence" start operation are logged with 20-second default timeout instead of pcmk_monitor_timeout (165s). Log excerpts: > Dec 02 15:30:47 vmware-beta-07.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [51816] (process_lrm_event) error: Result of monitor operation for vmfence on vmware-beta-07: Timed Out | call=16 key=vmfence_monitor_60000 timeout=20000ms > Dec 02 15:33:34 vmware-beta-07.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [51816] (process_lrm_event) error: Result of start operation for vmfence on vmware-beta-07: Timed Out | call=23 key=vmfence_start_0 timeout=20000ms after fix ---------- [root@vmware-stable-12 ~]# rpm -q pacemaker pacemaker-2.0.5-2.el8.x86_64 Configure a fence device "vmfence" and allow it to fully start. Set pcmk_monitor_timeout to 165s. > [root@vmware-stable-12 ~]# pcs status > Cluster name: vmware-brq-stable-06 > Cluster Summary: > * Stack: corosync > * Current DC: vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com (version 2.0.5-2.el8-31aa4f5515) - partition with quorum > * Last updated: Thu Nov 26 14:11:40 2020 > * Last change: Thu Nov 26 14:11:30 2020 by root via cibadmin on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com > * 2 nodes configured > * 2 resource instances configured > Node List: > * Online: [ vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com ] > Full List of Resources: > * fence_vmware_rest (stonith:fence_vmware_rest): Started vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com > * vmfence (stonith:fence_vmware_soap): Started vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@vmware-stable-12 ~]# pcs stonith update vmfence pcmk_monitor_timeout=165s > [root@vmware-stable-12 ~]# pcs stonith config vmfence > Resource: vmfence (class=stonith type=fence_vmware_soap) > Attributes: ipaddr=vcenter-stable.cluster-qe.lab.eng.brq.redhat.com login=<redacted> passwd=<redacted> pcmk_host_check=static-list pcmk_host_list=vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com,vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com pcmk_host_map=vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com:vmware-stable-11:vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com:vmware-stable-12 pcmk_monitor_timeout=165s ssl=on ssl_insecure=1 > Operations: monitor interval=60s (vmfence-monitor-interval-60s) Modify /usr/sbin/fence_vmware_soap as it is described in comment 1. This leads to a monitor failure and then the stonith device "vmfence" fails to start. > [root@vmware-stable-12 ~]# pcs status > Cluster name: vmware-brq-stable-06 > Cluster Summary: > * Stack: corosync > * Current DC: vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com (version 2.0.5-2.el8-31aa4f5515) - partition with quorum > * Last updated: Thu Nov 26 14:27:48 2020 > * Last change: Thu Nov 26 14:18:31 2020 by root via cibadmin on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com > * 2 nodes configured > * 2 resource instances configured > Node List: > * Online: [ vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com ] > Full List of Resources: > * fence_vmware_rest (stonith:fence_vmware_rest): Started vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com > * vmfence (stonith:fence_vmware_soap): Started vmware-stable-11.cluster-qe.lab.eng.brq.redhat.com > Failed Resource Actions: > * vmfence_start_0 on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com 'OCF_TIMEOUT' (198): call=28, status='Timed Out', exitreason='', last-rc-change='2020-11-26 14:24:35 +01:00', queued=0ms, exec=166165ms > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled Both monitor failure and failed "vmfence" start operation are logged with pcmk_monitor_timeout (165s) Log excerpts: > Nov 26 14:24:35 vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [76149] (process_lrm_event) error: Result of monitor operation for vmfence on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com: Timed Out | call=20 key=vmfence_monitor_60000 timeout=165000ms > Nov 26 14:27:21 vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [76149] (process_lrm_event) error: Result of start operation for vmfence on vmware-stable-12.cluster-qe.lab.eng.brq.redhat.com: Timed Out | call=28 key=vmfence_start_0 timeout=165000ms marking verified in pacemaker-2.0.5-2.el8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 |
Created attachment 1700705 [details] Debug logs from test PCMK_debug=pacemaker-controld,pacemaker-execd,pacemaker-fenced Resource: vmfence (class=stonith type=fence_vmware_soap) Attributes: ipaddr=vcenter.gsslab.brq.redhat.com login=gssuser passwd=<redacted> pcmk_monitor_timeout=600s ssl_insecure=1 Operations: monitor interval=60s (vmfence-monitor-interval-60s) Added to fence_vmware_soap soap_login(): ~~~ sleep(100) fail(EC_LOGIN_DENIED) ~~~