Bug 1941108
| Summary: | Stonith action times out if fenced node is still in CPG membership when the stonith action timeout expires | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED MIGRATED | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 8.3 | CC: | cluster-maint, jke, sbradley |
| Target Milestone: | rc | Keywords: | MigratedToJIRA, Triaged |
| Target Release: | 8.5 | Flags: | pm-rhel:
mirror+
|
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-22 19:01:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Two general responses to this come to mind: The first, and obviously simplest, is to change nothing, document the issue in the KB, and recommend that users set the pcmk_reboot_timeout to a value more than (token) + (consensus) + (how long they expect the reboot to take). The only downside I see for this -- besides that the issue would ideally be handled automatically -- is that most users and support engineers currently believe pcmk_reboot_timeout only needs to be set to a value higher than "how long they expect the reboot to take". It's never an exact science anyway and usually gets increased in increments of 60 to 240 seconds at a time, so IMO this wouldn't be an undue hardship. The second is to avoid initiating a second fence attempt but ensure the reply is received before declaring the action complete. Something like "if we have an operation result locally (i.e., the async command completed) within the stonith action timeout, then keep waiting for the async reply to be delivered but don't mark the action as timed out." There may or may not be a safe way to implement that. I read a little bit more about CPG and virtual synchrony last night, and it sounds like for better or for worse we're stuck waiting for that reply and won't get it until the fenced node leaves the membership. (In reply to Reid Wahl from comment #1) > The second is to avoid initiating a second fence attempt but ensure the > reply is received before declaring the action complete. Something like "if > we have an operation result locally (i.e., the async command completed) > within the stonith action timeout, then keep waiting for the async reply to > be delivered but don't mark the action as timed out." There may or may not > be a safe way to implement that. D'oh. The origin may not be same node that's executing fencing, if the action was delegated. That renders #2 non-feasible as described. Two-node test cluster tunnel vision :) One other idea occurred to me: add the token timeout and consensus timeout to the stonith action timeout (#if SUPPORT_COROSYNC). Maybe we could do this conditionally, only if the node-to-be-fenced is still part of the CPG membership when fencing is initiated. There's no reason to add the extra timeout if the node has already left the membership and is getting fenced due to token loss. This way, only the timeout would change. We wouldn't be adding complexity or introducing bugs elsewhere. If I follow, the issue is that the originator of fencing must wait for a CPG reply from the executor of fencing (even if it's the same node), and the CPG reply can't arrive until corosync detects the fenced node as having left the CPG. That means the token timeout eats into the fence action timeout, even if the fence action actually completed early in that period. If the fence action timeout is lower than the token timeout, it will always time out. My first instinct is that the originator should wait the maximum of the fence action timeout and the corosync CPG timeout before declaring the operation timed out. The fencer would need to know the CPG timeout, possibly requiring a CFG API connection, and I'm not completely sure that's a good idea. We could maybe use the cluster-delay property instead, though we'd have to recommend that users set that above the token timeout (which may already be a practical requirement, I'm not sure). Aside: there are no specific definitions of task, action, and operation in Pacemaker, and they are often used interchangeably. I lean toward using "action" for the generic type of action (start, reboot, etc.) and "operation" for a specific instance of that action, but it's inconsistent currently. (In reply to Ken Gaillot from comment #4) > If I follow, the issue is that the originator of fencing must wait for a CPG > reply from the executor of fencing (even if it's the same node), and the CPG > reply can't arrive until corosync detects the fenced node as having left the > CPG. That means the token timeout eats into the fence action timeout, even > if the fence action actually completed early in that period. If the fence > action timeout is lower than the token timeout, it will always time out. That sounds accurate. For fencing due to token loss, this is not an issue as long as the failed node doesn't rejoin the CPG membership before fencing completes. It can occur when a node is fenced for other reasons. > My first instinct is that the originator should wait the maximum of the > fence action timeout and the corosync CPG timeout before declaring the > operation timed out. The fencer would need to know the CPG timeout, possibly > requiring a CFG API connection, and I'm not completely sure that's a good > idea. We could maybe use the cluster-delay property instead, though we'd > have to recommend that users set that above the token timeout Using cluster-delay is worth testing. I can't remember off the top of my head whether the timeout here includes the cluster-delay or not. Setting pcmk_<action>_timeout is already a valid workaround, although it requires accounting for both the token timeout and the time it takes for the stonith action to complete. > (which may > already be a practical requirement, I'm not sure). Not that I know of. We have customers setting the token timeout to values as high as 300000 (the max supported value), and I can't recall ever seeing a customer override the default cluster-delay. It hasn't caused any issues that we've identified yet (besides potentially this one). (In reply to Reid Wahl from comment #5) > (In reply to Ken Gaillot from comment #4) > > If I follow, the issue is that the originator of fencing must wait for a CPG > > reply from the executor of fencing (even if it's the same node), and the CPG > > reply can't arrive until corosync detects the fenced node as having left the > > CPG. That means the token timeout eats into the fence action timeout, even > > if the fence action actually completed early in that period. If the fence > > action timeout is lower than the token timeout, it will always time out. > > That sounds accurate. For fencing due to token loss, this is not an issue as > long as the failed node doesn't rejoin the CPG membership before fencing > completes. It can occur when a node is fenced for other reasons. That makes sense. It sounds like the possibility is fairly small, with the most likely scenario being a node gradually becoming unusable (due to load or network issues, etc.), something like a stop failure being the first fenceable symptom, and the node dropping out completely just after the fence action is initiated. > > My first instinct is that the originator should wait the maximum of the > > fence action timeout and the corosync CPG timeout before declaring the > > operation timed out. The fencer would need to know the CPG timeout, possibly > > requiring a CFG API connection, and I'm not completely sure that's a good > > idea. We could maybe use the cluster-delay property instead, though we'd > > have to recommend that users set that above the token timeout > > Using cluster-delay is worth testing. I can't remember off the top of my > head whether the timeout here includes the cluster-delay or not. > > Setting pcmk_<action>_timeout is already a valid workaround, although it > requires accounting for both the token timeout and the time it takes for the > stonith action to complete. > > > > (which may > > already be a practical requirement, I'm not sure). > > Not that I know of. We have customers setting the token timeout to values as > high as 300000 (the max supported value), and I can't recall ever seeing a :-o > customer override the default cluster-delay. It hasn't caused any issues > that we've identified yet (besides potentially this one). In that case it's probably best not to overload cluster-delay. A solution that occurs to me would be to track the token timeout (via a CFG connection) in the fencer, and add it to the fence action timeout when the target is an active cluster member. Given the rarity, the limited impact, and the availability of a workaround, I'm setting this low priority. For now, we can document the issue. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |
Description of problem: If a fenced node is still in the corosync CPG membership when the stonith action's timeout expires, the node gets rebooted successfully but the stonith action fails due to a timeout. This results in a fenced node getting rebooted twice in a row. The stonith action timeout seems to default to 72s == TIMEOUT_MULTIPLY_FACTOR * peer (or device?) timeout. So in practice, this issue tends to happen if: ((time for stonith op to complete) + token + consensus) > 72s A simple way to reproduce is: 1. set the token timeout to 40000 (a nice round number) 2. disable cluster services from starting on boot on the node to be fenced 3. cause a resource operation with on-fail=fence to fail on one node With debug enabled for pacemaker-fenced and trace enabled for fenced_remote.c and fenced_commands.c, you should get logs like the following when node 2 gets fenced. Note that the operation returns "OK" but the stonith action still times out. (I could be misusing operation/action terms but you get the idea.) [root@fastvm-rhel-8-0-23 pacemaker]# date; corosync-cmapctl | grep 'totem\.token '; pcs resource config | grep dummy1-monitor-interval; pcs stonith config | grep -A2 xvm2 Fri Mar 19 17:53:15 PDT 2021 runtime.config.totem.token (u32) = 40000 totem.token (u32) = 40000 monitor interval=10s on-fail=fence timeout=20s (dummy1-monitor-interval-10s) Resource: xvm2 (class=stonith type=fence_xvm) Attributes: pcmk_host_map=node2:fastvm-rhel-8.0-24 Operations: monitor interval=60s (xvm2-monitor-interval-60s) # # Request and initiate stonith action against node2 Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (unpack_rsc_op_failure:3027) warning: Unexpected result (error) was recorded for monitor of dummy1 on node2 at Mar 19 17:53:44 2021 | rc=1 id=dummy1_last_failure_0 Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (pe_fence_node:162) warning: Cluster node node2 will be fenced: dummy1 failed there Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (stage6:1652) warning: Scheduling Node node2 for STONITH Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-controld [333021] (te_fence_node:859) notice: Requesting fencing (reboot) of node node2 | action=6 timeout=60000 Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (can_fence_host_with_device:1738) notice: xvm2 is eligible to fence (reboot) node2 (aka. 'fastvm-rhel-8.0-24'): static-list Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (call_remote_stonith:1544) info: Total timeout set to 60 for peer's fencing targeting node2 for pacemaker-controld.333021|id=0db9abc2 Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (call_remote_stonith:1586) notice: Requesting that node1 perform 'reboot' action targeting node2 | for client pacemaker-controld.333021 (72s, 0s) Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (schedule_stonith_command:506) debug: Scheduling 'reboot' action targeting node2 using xvm2 for remote peer node1 with op id 0db9abc2 and timeout 60s Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (fork_cb:321) debug: Operation 'reboot' [333087] targeting node2 using xvm2 now running with 60s timeout # # Reboot succeeds Mar 19 17:53:46 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (log_operation:2076) notice: Operation 'reboot' [333087] (call 6 from pacemaker-controld.333021) targeting node2 using xvm2 returned 0 (OK) Mar 19 17:53:46 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (stonith_send_async_reply:2128) trace: Directed reply to node1 Mar 19 17:53:46 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (stonith_device_execute:373) trace: No actions using xvm2 are needed # # Token loss occurs (corosync.log) Mar 19 17:54:00 [1729] fastvm-rhel-8-0-23 corosync info [KNET ] link: host: 2 link: 0 is down Mar 19 17:54:00 [1729] fastvm-rhel-8-0-23 corosync info [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) Mar 19 17:54:00 [1729] fastvm-rhel-8-0-23 corosync warning [KNET ] host: host: 2 has no active links Mar 19 17:54:14 [1729] fastvm-rhel-8-0-23 corosync notice [TOTEM ] Token has not been received in 30000 ms Mar 19 17:54:24 [1729] fastvm-rhel-8-0-23 corosync notice [TOTEM ] A processor failed, forming new configuration. # # Action times out 72 seconds after it began Mar 19 17:54:56 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (remote_op_query_timeout:647) debug: Operation 0db9abc2 targeting node2 already in progress Mar 19 17:54:56 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (remote_op_timeout:618) debug: Action 'reboot' targeting node2 for client pacemaker-controld.333021 timed out | id=0db9abc2 Mar 19 17:54:56 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (stonith_bcast_result_to_peers:383) trace: Broadcasting result to peers # # node2 leaves the CPG membership and the action finishes failing # # corosync.log Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync notice [TOTEM ] A new membership (1.115e2) was formed. Members left: 2 Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync notice [TOTEM ] Failed to receive the leave message. failed: 2 Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync warning [CPG ] downlist left_list: 1 received Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync notice [QUORUM] Members[1]: 1 Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync notice [MAIN ] Completed service synchronization, ready to provide service. # # pacemaker.log Mar 19 17:55:12 fastvm-rhel-8-0-23 pacemaker-attrd [333019] (pcmk_cpg_membership:673) info: Group attrd event 2: node2 (node 2 pid 1690) left via cluster exit Mar 19 17:55:12 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (remote_op_done:549) error: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.333021@node1: Timer expired | id=0db9abc2 Mar 19 17:55:12 fastvm-rhel-8-0-23 pacemaker-fenced [333017] (remote_op_done:503) error: Already sent notifications for 'reboot' targeting node2 by node1 for client pacemaker-controld.333021@node1: OK | rc=0 state=completed id=0db9abc2 Mar 19 17:55:12 fastvm-rhel-8-0-23 pacemaker-controld [333021] (tengine_stonith_notify:511) notice: Peer node2 was not terminated (reboot) by node1 on behalf of pacemaker-controld.333021: Timer expired | initiator=node1 ref=0db9abc2-2b07-4048-9746-4bceee5528e3 # # node2 gets fenced again Mar 19 17:55:13 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (pe_fence_node:162) warning: Cluster node node2 will be fenced: peer is no longer part of the cluster Mar 19 17:55:13 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (unpack_rsc_op_failure:3027) warning: Unexpected result (error) was recorded for monitor of dummy1 on node2 at Mar 19 17:53:44 2021 | rc=1 id=dummy1_last_failure_0 Mar 19 17:55:13 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (stage6:1652) warning: Scheduling Node node2 for STONITH ----- Version-Release number of selected component (if applicable): Using upstream master RHEL release would be pacemaker-2.0.4-6.el8_3.1. ----- How reproducible: Always ----- Steps to Reproduce: 1. Set the Corosync token timeout to 40000. - This is a nice round number and testing should be consistent this way, but I was able to reproduce the issue with at least as low as 34000. 2. Disable cluster services from starting on boot on the node to be fenced (`pcs cluster disable`). 3. Cause a resource operation with on-fail=fence to fail on one node. - This was the first that came to mind without relying on stonith_admin, which can sometimes behave differently compared to automatic fencing. ----- Actual results: The stonith action times out despite the node getting fenced. The node gets rebooted twice, as pacemaker tries to fence it a second time. ----- Expected results: The stonith action succeeds the first time, provided that the fenced node gets rebooted successfully. ----- Additional info: This is also reproducible on RHEL 7 (which is where it was observed in a customer environment), but it doesn't seem worthwhile to backport a fix unless it's fairly trivial to do so. This is a bit of a corner case.