1941108 – Stonith action times out if fenced node is still in CPG membership when the stonith action timeout expires

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1941108 - Stonith action times out if fenced node is still in CPG membership when the stonith action timeout expires

Summary: Stonith action times out if fenced node is still in CPG membership when the s...

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.3
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	8.5
Assignee:	Ken Gaillot
QA Contact:	cluster-qe
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-20 01:16 UTC by Reid Wahl
Modified:	2024-10-01 17:44 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-22 19:01:24 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-7625	0	None	Migrated	None	2023-09-22 19:00:53 UTC
Red Hat Knowledge Base (Solution)	5895361	0	None	None	None	2021-03-20 22:35:49 UTC

Description Reid Wahl 2021-03-20 01:16:02 UTC

Description of problem:

If a fenced node is still in the corosync CPG membership when the stonith action's timeout expires, the node gets rebooted successfully but the stonith action fails due to a timeout. This results in a fenced node getting rebooted twice in a row.

The stonith action timeout seems to default to 72s == TIMEOUT_MULTIPLY_FACTOR * peer (or device?) timeout. So in practice, this issue tends to happen if:

    ((time for stonith op to complete) + token + consensus) > 72s


A simple way to reproduce is:
  1. set the token timeout to 40000 (a nice round number)
  2. disable cluster services from starting on boot on the node to be fenced
  3. cause a resource operation with on-fail=fence to fail on one node


With debug enabled for pacemaker-fenced and trace enabled for fenced_remote.c and fenced_commands.c, you should get logs like the following when node 2 gets fenced. Note that the operation returns "OK" but the stonith action still times out. (I could be misusing operation/action terms but you get the idea.)

[root@fastvm-rhel-8-0-23 pacemaker]# date; corosync-cmapctl | grep 'totem\.token '; pcs resource config | grep dummy1-monitor-interval; pcs stonith config | grep -A2 xvm2
Fri Mar 19 17:53:15 PDT 2021
runtime.config.totem.token (u32) = 40000
totem.token (u32) = 40000
              monitor interval=10s on-fail=fence timeout=20s (dummy1-monitor-interval-10s)
 Resource: xvm2 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_map=node2:fastvm-rhel-8.0-24
  Operations: monitor interval=60s (xvm2-monitor-interval-60s)


# # Request and initiate stonith action against node2
Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (unpack_rsc_op_failure:3027)   warning: Unexpected result (error) was recorded for monitor of dummy1 on node2 at Mar 19 17:53:44 2021 | rc=1 id=dummy1_last_failure_0
Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (pe_fence_node:162)    warning: Cluster node node2 will be fenced: dummy1 failed there
Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (stage6:1652)     warning: Scheduling Node node2 for STONITH
Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-controld  [333021] (te_fence_node:859)  notice: Requesting fencing (reboot) of node node2 | action=6 timeout=60000
Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (can_fence_host_with_device:1738)     notice: xvm2 is eligible to fence (reboot) node2 (aka. 'fastvm-rhel-8.0-24'): static-list
Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (call_remote_stonith:1544)  info: Total timeout set to 60 for peer's fencing targeting node2 for pacemaker-controld.333021|id=0db9abc2
Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (call_remote_stonith:1586)  notice: Requesting that node1 perform 'reboot' action targeting node2 | for client pacemaker-controld.333021 (72s, 0s)
Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (schedule_stonith_command:506)    debug: Scheduling 'reboot' action targeting node2 using xvm2 for remote peer node1 with op id 0db9abc2 and timeout 60s
Mar 19 17:53:44 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (fork_cb:321)     debug: Operation 'reboot' [333087] targeting node2 using xvm2 now running with 60s timeout

# # Reboot succeeds
Mar 19 17:53:46 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (log_operation:2076)  notice: Operation 'reboot' [333087] (call 6 from pacemaker-controld.333021) targeting node2 using xvm2 returned 0 (OK)
Mar 19 17:53:46 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (stonith_send_async_reply:2128)   trace: Directed reply to node1
Mar 19 17:53:46 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (stonith_device_execute:373)  trace: No actions using xvm2 are needed

# # Token loss occurs (corosync.log)
Mar 19 17:54:00 [1729] fastvm-rhel-8-0-23 corosync info    [KNET  ] link: host: 2 link: 0 is down
Mar 19 17:54:00 [1729] fastvm-rhel-8-0-23 corosync info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Mar 19 17:54:00 [1729] fastvm-rhel-8-0-23 corosync warning [KNET  ] host: host: 2 has no active links
Mar 19 17:54:14 [1729] fastvm-rhel-8-0-23 corosync notice  [TOTEM ] Token has not been received in 30000 ms
Mar 19 17:54:24 [1729] fastvm-rhel-8-0-23 corosync notice  [TOTEM ] A processor failed, forming new configuration.

# # Action times out 72 seconds after it began
Mar 19 17:54:56 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (remote_op_query_timeout:647)   debug: Operation 0db9abc2 targeting node2 already in progress
Mar 19 17:54:56 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (remote_op_timeout:618)     debug: Action 'reboot' targeting node2 for client pacemaker-controld.333021 timed out | id=0db9abc2
Mar 19 17:54:56 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (stonith_bcast_result_to_peers:383)     trace: Broadcasting result to peers

# # node2 leaves the CPG membership and the action finishes failing
  # # corosync.log
Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync notice  [TOTEM ] A new membership (1.115e2) was formed. Members left: 2
Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync notice  [TOTEM ] Failed to receive the leave message. failed: 2
Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync warning [CPG   ] downlist left_list: 1 received
Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync notice  [QUORUM] Members[1]: 1
Mar 19 17:55:12 [1729] fastvm-rhel-8-0-23 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.

  # # pacemaker.log
Mar 19 17:55:12 fastvm-rhel-8-0-23 pacemaker-attrd     [333019] (pcmk_cpg_membership:673)     info: Group attrd event 2: node2 (node 2 pid 1690) left via cluster exit
Mar 19 17:55:12 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (remote_op_done:549)    error: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.333021@node1: Timer expired | id=0db9abc2
Mar 19 17:55:12 fastvm-rhel-8-0-23 pacemaker-fenced    [333017] (remote_op_done:503)    error: Already sent notifications for 'reboot' targeting node2 by node1 for client pacemaker-controld.333021@node1: OK | rc=0 state=completed id=0db9abc2
Mar 19 17:55:12 fastvm-rhel-8-0-23 pacemaker-controld  [333021] (tengine_stonith_notify:511)     notice: Peer node2 was not terminated (reboot) by node1 on behalf of pacemaker-controld.333021: Timer expired | initiator=node1 ref=0db9abc2-2b07-4048-9746-4bceee5528e3

# # node2 gets fenced again
Mar 19 17:55:13 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (pe_fence_node:162)    warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
Mar 19 17:55:13 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (unpack_rsc_op_failure:3027)   warning: Unexpected result (error) was recorded for monitor of dummy1 on node2 at Mar 19 17:53:44 2021 | rc=1 id=dummy1_last_failure_0
Mar 19 17:55:13 fastvm-rhel-8-0-23 pacemaker-schedulerd[333020] (stage6:1652)     warning: Scheduling Node node2 for STONITH

-----

Version-Release number of selected component (if applicable):

Using upstream master

RHEL release would be pacemaker-2.0.4-6.el8_3.1.

-----

How reproducible:

Always

-----

Steps to Reproduce:
  1. Set the Corosync token timeout to 40000.
    - This is a nice round number and testing should be consistent this way, but I was able to reproduce the issue with at least as low as 34000.
  2. Disable cluster services from starting on boot on the node to be fenced (`pcs cluster disable`).
  3. Cause a resource operation with on-fail=fence to fail on one node.
    - This was the first that came to mind without relying on stonith_admin, which can sometimes behave differently compared to automatic fencing.

-----

Actual results:

The stonith action times out despite the node getting fenced. The node gets rebooted twice, as pacemaker tries to fence it a second time.

-----

Expected results:

The stonith action succeeds the first time, provided that the fenced node gets rebooted successfully.

-----

Additional info:

This is also reproducible on RHEL 7 (which is where it was observed in a customer environment), but it doesn't seem worthwhile to backport a fix unless it's fairly trivial to do so. This is a bit of a corner case.

Comment 1 Reid Wahl 2021-03-20 20:47:13 UTC

Two general responses to this come to mind:

The first, and obviously simplest, is to change nothing, document the issue in the KB, and recommend that users set the pcmk_reboot_timeout to a value more than (token) + (consensus) + (how long they expect the reboot to take). The only downside I see for this -- besides that the issue would ideally be handled automatically -- is that most users and support engineers currently believe pcmk_reboot_timeout only needs to be set to a value higher than "how long they expect the reboot to take". It's never an exact science anyway and usually gets increased in increments of 60 to 240 seconds at a time, so IMO this wouldn't be an undue hardship.

The second is to avoid initiating a second fence attempt but ensure the reply is received before declaring the action complete. Something like "if we have an operation result locally (i.e., the async command completed) within the stonith action timeout, then keep waiting for the async reply to be delivered but don't mark the action as timed out." There may or may not be a safe way to implement that.


I read a little bit more about CPG and virtual synchrony last night, and it sounds like for better or for worse we're stuck waiting for that reply and won't get it until the fenced node leaves the membership.

Comment 2 Reid Wahl 2021-03-20 22:10:11 UTC

(In reply to Reid Wahl from comment #1)
> The second is to avoid initiating a second fence attempt but ensure the
> reply is received before declaring the action complete. Something like "if
> we have an operation result locally (i.e., the async command completed)
> within the stonith action timeout, then keep waiting for the async reply to
> be delivered but don't mark the action as timed out." There may or may not
> be a safe way to implement that.

D'oh. The origin may not be same node that's executing fencing, if the action was delegated. That renders #2 non-feasible as described.

Two-node test cluster tunnel vision :)

Comment 3 Reid Wahl 2021-03-21 03:50:33 UTC

One other idea occurred to me: add the token timeout and consensus timeout to the stonith action timeout (#if SUPPORT_COROSYNC). Maybe we could do this conditionally, only if the node-to-be-fenced is still part of the CPG membership when fencing is initiated. There's no reason to add the extra timeout if the node has already left the membership and is getting fenced due to token loss.

This way, only the timeout would change. We wouldn't be adding complexity or introducing bugs elsewhere.

Comment 4 Ken Gaillot 2021-04-19 22:25:11 UTC

If I follow, the issue is that the originator of fencing must wait for a CPG reply from the executor of fencing (even if it's the same node), and the CPG reply can't arrive until corosync detects the fenced node as having left the CPG. That means the token timeout eats into the fence action timeout, even if the fence action actually completed early in that period. If the fence action timeout is lower than the token timeout, it will always time out.

My first instinct is that the originator should wait the maximum of the fence action timeout and the corosync CPG timeout before declaring the operation timed out. The fencer would need to know the CPG timeout, possibly requiring a CFG API connection, and I'm not completely sure that's a good idea. We could maybe use the cluster-delay property instead, though we'd have to recommend that users set that above the token timeout (which may already be a practical requirement, I'm not sure).

Aside: there are no specific definitions of task, action, and operation in Pacemaker, and they are often used interchangeably. I lean toward using "action" for the generic type of action (start, reboot, etc.) and "operation" for a specific instance of that action, but it's inconsistent currently.

Comment 5 Reid Wahl 2021-05-04 08:00:23 UTC

(In reply to Ken Gaillot from comment #4)
> If I follow, the issue is that the originator of fencing must wait for a CPG
> reply from the executor of fencing (even if it's the same node), and the CPG
> reply can't arrive until corosync detects the fenced node as having left the
> CPG. That means the token timeout eats into the fence action timeout, even
> if the fence action actually completed early in that period. If the fence
> action timeout is lower than the token timeout, it will always time out.

That sounds accurate. For fencing due to token loss, this is not an issue as long as the failed node doesn't rejoin the CPG membership before fencing completes. It can occur when a node is fenced for other reasons.


> My first instinct is that the originator should wait the maximum of the
> fence action timeout and the corosync CPG timeout before declaring the
> operation timed out. The fencer would need to know the CPG timeout, possibly
> requiring a CFG API connection, and I'm not completely sure that's a good
> idea. We could maybe use the cluster-delay property instead, though we'd
> have to recommend that users set that above the token timeout

Using cluster-delay is worth testing. I can't remember off the top of my head whether the timeout here includes the cluster-delay or not.

Setting pcmk_<action>_timeout is already a valid workaround, although it requires accounting for both the token timeout and the time it takes for the stonith action to complete.


> (which may
> already be a practical requirement, I'm not sure).

Not that I know of. We have customers setting the token timeout to values as high as 300000 (the max supported value), and I can't recall ever seeing a customer override the default cluster-delay. It hasn't caused any issues that we've identified yet (besides potentially this one).

Comment 6 Ken Gaillot 2021-05-13 14:59:13 UTC

(In reply to Reid Wahl from comment #5)
> (In reply to Ken Gaillot from comment #4)
> > If I follow, the issue is that the originator of fencing must wait for a CPG
> > reply from the executor of fencing (even if it's the same node), and the CPG
> > reply can't arrive until corosync detects the fenced node as having left the
> > CPG. That means the token timeout eats into the fence action timeout, even
> > if the fence action actually completed early in that period. If the fence
> > action timeout is lower than the token timeout, it will always time out.
> 
> That sounds accurate. For fencing due to token loss, this is not an issue as
> long as the failed node doesn't rejoin the CPG membership before fencing
> completes. It can occur when a node is fenced for other reasons.

That makes sense. It sounds like the possibility is fairly small, with the most likely scenario being a node gradually becoming unusable (due to load or network issues, etc.), something like a stop failure being the first fenceable symptom, and the node dropping out completely just after the fence action is initiated.

> > My first instinct is that the originator should wait the maximum of the
> > fence action timeout and the corosync CPG timeout before declaring the
> > operation timed out. The fencer would need to know the CPG timeout, possibly
> > requiring a CFG API connection, and I'm not completely sure that's a good
> > idea. We could maybe use the cluster-delay property instead, though we'd
> > have to recommend that users set that above the token timeout
> 
> Using cluster-delay is worth testing. I can't remember off the top of my
> head whether the timeout here includes the cluster-delay or not.
> 
> Setting pcmk_<action>_timeout is already a valid workaround, although it
> requires accounting for both the token timeout and the time it takes for the
> stonith action to complete.
> 
> 
> > (which may
> > already be a practical requirement, I'm not sure).
> 
> Not that I know of. We have customers setting the token timeout to values as
> high as 300000 (the max supported value), and I can't recall ever seeing a

:-o

> customer override the default cluster-delay. It hasn't caused any issues
> that we've identified yet (besides potentially this one).

In that case it's probably best not to overload cluster-delay.

A solution that occurs to me would be to track the token timeout (via a CFG connection) in the fencer, and add it to the fence action timeout when the target is an active cluster member.

Given the rarity, the limited impact, and the availability of a workaround, I'm setting this low priority. For now, we can document the issue.

Comment 10 RHEL Program Management 2023-09-22 18:58:56 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 11 RHEL Program Management 2023-09-22 19:01:24 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.