1780515 – fence_aws: fenced node always performs a graceful shutdown

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1780515 - fence_aws: fenced node always performs a graceful shutdown

Summary: fence_aws: fenced node always performs a graceful shutdown

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	fence-agents
Sub Component:
Version:	8.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.1
Assignee:	Oyvind Albrigtsen
QA Contact:	Brandon Perkins
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-06 09:09 UTC by Reid Wahl
Modified:	2024-03-25 15:33 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-22 18:36:34 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-7601	None	Migrated	None	2023-09-22 18:36:09 UTC
Red Hat Knowledge Base (Solution)	4642491	None	None	None	2019-12-08 08:51:37 UTC
Red Hat Knowledge Base (Solution)	4644971	None	None	None	2019-12-09 19:38:34 UTC

Description Reid Wahl 2019-12-06 09:09:42 UTC

Description of problem:

Neither a delay attribute nor a pcmk_delay_max attribute properly prevents a fence race in an AWS Pacemaker cluster.

This may be because the EC2 StopInstances endpoint initiates a graceful shutdown rather than a hard reboot. I have yet to find a way to perform a hard power-off of an EC2 instance.

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_StopInstances.html

-----

Version-Release number of selected component (if applicable):

fence-agents-aws-4.2.1-30.el8_1.1.noarch
pacemaker-2.0.2-3.el8.x86_64

-----

How reproducible:

Always so far. The issue may be avoidable with a long enough delay, but that's not practical in production.

-----

Steps to Reproduce:
1. In a two-node cluster, configure one fence_aws stonith device for each node. This issue can be reproduced with a shared stonith device and pcmk_delay_max, but a static delay can be configured for a single node this way for consistency.

2. Set delay=60 on one of the stonith devices.

3. Simulate a heartbeat network failure between nodes.

-----

Actual results:

Only the node without the delay gets fenced.

-----

Expected results:

Both nodes get fenced. The node with the delay does not get rebooted until after its delay expires.

-----

Additional info:

This issue was reported by a user working for AWS. They've stated, "I'm more than happy to work side by side with Red Hat development to test and help to provide code for an improved agent version."

Comment 1 Reid Wahl 2019-12-06 09:11:31 UTC

(In reply to Reid Wahl from comment #0)

Switch the Actual results with the Expected results.

Comment 2 Reid Wahl 2019-12-06 09:33:19 UTC

Instance start/stop behavior explanation:
  - Stop and Start Your Instance (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html)

I have tried setting HandlePowerKey=ignore as described in our Product Docs (see below), but this did not help.
  - 5.11.2. Disabling ACPI Soft-Off in the logind.conf file (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/index#s2-acpi-disable-logind-CA)


I've asked the customer about whether there's a way to initiate a hard power-off.

Comment 3 Reid Wahl 2019-12-06 09:43:12 UTC

Weird thing, now that I look at the node without the delay -- from the OS perspective, it seems to have rebooted about two seconds after token loss. It still shows "stopping" in the AWS console for a while after this.

Node 2:
Dec  6 09:23:33 ip-10-0-0-19 corosync[1180]:  [KNET  ] link: host: 1 link: 0 is down
Dec  6 09:23:33 ip-10-0-0-19 corosync[1180]:  [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec  6 09:23:33 ip-10-0-0-19 corosync[1180]:  [KNET  ] host: host: 1 has no active links
Dec  6 09:23:33 ip-10-0-0-19 corosync[1180]:  [TOTEM ] Token has not been received in 750 ms
Dec  6 09:23:34 ip-10-0-0-19 corosync[1180]:  [TOTEM ] A processor failed, forming new configuration.
Dec  6 09:23:35 ip-10-0-0-19 corosync[1180]:  [TOTEM ] A new membership (2:184) was formed. Members left: 1
Dec  6 09:23:35 ip-10-0-0-19 corosync[1180]:  [TOTEM ] Failed to receive the leave message. failed: 1
Dec  6 09:23:35 ip-10-0-0-19 corosync[1180]:  [CPG   ] downlist left_list: 1 received
Dec  6 09:23:35 ip-10-0-0-19 corosync[1180]:  [QUORUM] Members[1]: 2
Dec  6 09:23:35 ip-10-0-0-19 corosync[1180]:  [MAIN  ] Completed service synchronization, ready to provide service.
...
Dec  6 09:23:35 ip-10-0-0-19 pacemaker-controld[1315]: notice: Requesting fencing (reboot) of node node1
Dec  6 09:23:35 ip-10-0-0-19 pacemaker-controld[1315]: notice: Initiating start operation node1_fence_start_0 locally on node2
Dec  6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: Client pacemaker-controld.1315.e6e254a6 wants to fence (reboot) 'node1' with device '(any)'
Dec  6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: Requesting peer fencing (reboot) of node1
Dec  6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: node1_fence is eligible to fence (reboot) node1 (aka. 'i-0c5a1d7b5fc91764d'): static-list
Dec  6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: node2_fence is not eligible to fence (reboot) node1: static-list
Dec  6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: node1_fence is eligible to fence (reboot) node1 (aka. 'i-0c5a1d7b5fc91764d'): static-list
Dec  6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: node2_fence is not eligible to fence (reboot) node1: static-list
...
Dec  6 09:23:37 ip-10-0-0-19 systemd[1]: Stopped Dynamic System Tuning Daemon.
Dec  6 09:26:12 ip-10-0-0-19 kernel: Linux version 4.18.0-147.0.3.el8_1.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20190507 (Red Hat 8.3.1-4) (GCC)) #1 SMP Mon Nov 11 12:58:36 UTC 2019


Node 1 was gracefully rebooted around 09:24:38. Maybe shutdown hung on node 2 above due to pacemaker waiting to fence, and this is just where the logs stopped.


Also, just realized I'm on RHEL 8 and the customer is on RHEL 7. I'll need to test on RHEL 7 and see if the bug presents itself in the same way.

Comment 4 Reid Wahl 2019-12-09 10:55:29 UTC

This is apparently caused by the slow shutdown as suspected. I printed the output of each get_power_status() call and tuned the delay timer of node 1's stonith device. The delay had to be longer than the time it takes node 2 to go from "stopping" to "stopped". Despite the fact that node 2 received an ACPI shutdown signal and killed its sshd process almost immediately, pacemaker is apparently in full swing (and able to fence node 1) until node 2 goes to "stopped" power state.

Node 2's typically takes more than two minutes to stop completely. A delay of 180 on node 1's stonith device prevented it from being powered off by node 2.

The customer said they're not aware of any way to issue a trigger a true hard power-off. I've requested that they reach out to their relevant dev team to determine the feasibility of adding such a feature, and they're inquiring about any existing alternatives that may help. Meanwhile if either we or AWS can figure out a way to get the node to stop faster, that may take care of the issue. The customer (and the AWS documentation) say that the instance stop is implemented by an ACPI shutdown signal to the OS. I've tried both of these, with no improvement:
  - https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/index#s2-acpi-disable-logind-CA
  - https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/index#s2-acpi-disable-boot-CA


The customer also put forth the following:
~~~
I had some internal discussions with some very savvy HA AWS consultants and the idea is to have a timer started when the instance switches to "Stopping", and if after this timer expires regardless of the instance moving to "Stopped" the fencing agent would return OK and let the cluster move on with the failover. The main reasons are:
1) After the instance transition to "Stopping" it won't ever be back into "Running" state
2) It is guaranteed that the instance receives an ACPI signal and the OS will start the shutdown sequence (as long as it doesn't ignore the ACPI signal), otherwise it will take much longer for the instance to actually stop
~~~

I replied with why that's not a viable option IMO (full response in case), but basically comes down to this: we know for sure that "stopping" state doesn't guarantee a node has relinquished its access to shared resources, because node 2 in "stopping" state is still "alive" enough to fence node 1 after the delay.

Comment 5 Reid Wahl 2019-12-09 11:13:06 UTC

(And of course it doesn't do any good to declare node 2 as fenced, if it turns around and fences node 1.)

Comment 6 Reid Wahl 2019-12-09 19:39:43 UTC

Linked case 02535181, which is for fencing timing out on bare metal instances, which the customer reports can take up to 10 minutes to stop. For now I'm tracking that issue via this same BZ, since a true hard power-off should be able to resolve both issues.

Comment 9 Reid Wahl 2019-12-12 00:57:48 UTC

Did a little more investigation. Pacemaker is what's hanging the graceful shutdown process on node 2.

I configured the rsyslog.service unit with `Before=pacemaker.service` so that messages would keep writing to the logs during the shutdown process until pacemaker stopped successfully. We already knew that pacemaker wasn't stopping completely for a while, since it was still able to fence node 1 after a more than 60-second delay.

Here are the logs that were gathered on node 2 during shutdown:

~~~
Dec 11 22:14:44 ip-10-0-0-19 corosync[1191]:  [TOTEM ] A processor failed, forming new configuration.
Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]:  [TOTEM ] A new membership (2:392) was formed. Members left: 1
Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]:  [TOTEM ] Failed to receive the leave message. failed: 1
Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]:  [CPG   ] downlist left_list: 1 received
Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]:  [QUORUM] Members[1]: 2
Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]:  [MAIN  ] Completed service synchronization, ready to provide service.
Dec 11 22:14:45 ip-10-0-0-19 pacemaker-based[1321]: notice: Node node1 state is now lost

Dec 11 22:14:45 ip-10-0-0-19 pacemaker-controld[1326]: notice: Requesting fencing (reboot) of node node1
[There is a delay configured for node1, so this doesn't do anything yet.]

[I didn't know that a stonith device would migrate before fencing had completed. That may be unexpected.]
Dec 11 22:14:45 ip-10-0-0-19 pacemaker-controld[1326]: notice: Initiating start operation node1_fence_start_0 locally on node2
Dec 11 22:14:46 ip-10-0-0-19 pacemaker-controld[1326]: notice: Result of start operation for node1_fence on node2: 0 (ok)
Dec 11 22:14:46 ip-10-0-0-19 pacemaker-controld[1326]: notice: Initiating monitor operation node1_fence_monitor_60000 locally on node2

[Node2 begins gracefully shutting down.]
Dec 11 22:14:47 ip-10-0-0-19 systemd[1]: Stopped target Host and Network Name Lookups.
...
Dec 11 22:14:47 ip-10-0-0-19 pacemakerd[1318]: notice: Caught 'Terminated' signal
Dec 11 22:14:47 ip-10-0-0-19 pacemakerd[1318]: notice: Shutting down Pacemaker
Dec 11 22:14:47 ip-10-0-0-19 pacemakerd[1318]: notice: Stopping pacemaker-controld
Dec 11 22:14:47 ip-10-0-0-19 systemd[1]: Stopping Pacemaker High Availability Cluster Manager...
Dec 11 22:14:47 ip-10-0-0-19 pacemaker-controld[1326]: notice: Caught 'Terminated' signal
Dec 11 22:14:47 ip-10-0-0-19 pacemaker-controld[1326]: notice: Shutting down cluster resource manager
Dec 11 22:14:47 ip-10-0-0-19 pacemaker-controld[1326]: warning: Input I_SHUTDOWN received in state S_TRANSITION_ENGINE from crm_shutdown
Dec 11 22:14:47 ip-10-0-0-19 pacemaker-controld[1326]: notice: Transition 1 aborted by transient_attributes.2 'create': Transient attribute change
Dec 11 22:14:47 ip-10-0-0-19 systemd[1532]: Reached target Shutdown.
Dec 11 22:14:47 ip-10-0-0-19 systemd[1532]: Starting Exit the Session...

[More than 75 seconds later, pacemaker still has not shut down.]
Dec 11 22:16:06 ip-10-0-0-19 pacemaker-controld[1326]: error: Node node2 did not send monitor result (via controller) within 20000ms (action timeout plus cluster-delay)
Dec 11 22:16:06 ip-10-0-0-19 pacemaker-controld[1326]: error: [Action    5]: In-flight rsc op node1_fence_monitor_60000         on node2 (priority: 0, waiting: none)
Dec 11 22:16:06 ip-10-0-0-19 pacemaker-controld[1326]: warning: rsc_op 5: node1_fence_monitor_60000 on node2 timed out
Dec 11 22:17:33 ip-10-0-0-19 kernel: Command line: BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-147.0.3.el8_1.x86_64 root=UUID=a727b695-0c21-404a-b42b-3075c8deb6ab ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau crashkernel=auto acpi=off
~~~


Presumably pacemaker shutdown is hung due to (a) the pending fence event against node1, and/or (b) the monitor operation for node1_fence.


I was able to circumvent this hang by setting the shutdown-escalation cluster property (I chose a value of 15s), forcing pacemaker to give up on a graceful stop after the shutdown-escalation timer expires (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/ch-clusteropts-haar).

Overriding the shutdown-escalation default is not an ideal resolution by any means, but it appears to be a valid workaround for virtual instances. Two notes:
  1) I doubt this would completely resolve the fence timeout issue for bare metal instances, since there are likely other factors besides pacemaker that are causing slow shutdowns of those instances. As mentioned privately, I do not have a bare metal test environment. The customer can help test fixes on bare metal if needed.
  2) I still have reason to believe that if we get a hard-reset option, it would resolve the issue in the most satisfactory way for both virtual and bare metal instances.

Comment 11 Reid Wahl 2020-02-05 03:09:34 UTC

The customer proposed a workaround that does seem to work under at least some circumstances.

1) Install the acpid package.
2) Make a backup of /etc/acpi/actions/power.sh
3) Overwrite the power.sh script with a script that simply does `echo b > /proc/sysrq-trigger`.
4) Start and enable the acpid.service systemd unit.


The idea is that the EC2 StopInstance call sends the EC2 instance an ACPI button press. This action handler script forces an immediate reboot.

My concern is that since this approach relies on the acpid daemon, it may not work under some circumstances, such as a completely hung node.


I've reiterated to the customer that the ideal resolution would be for AWS to add a hard-reset or hard-poweroff API endpoint, similar to what IPMI devices provide for physical nodes and VMware and KVM provide for their virtual guests.

Comment 12 John Ruemker 2020-03-18 14:06:01 UTC

My reading of this bug and the cases leads me to believe the following:

- The AWS endpoint our agent is calling initiates a soft shutdown 
- This is NOT limited to bare-metal, and IS APPLICABLE to VMs.  

Am I correct?  If so, this seems like a big problem with high impact for any customers trying to use AWS.  

Oyvind: Do you know whether the agent as-written was expected to perform a hard power-off?? 

Reid: Have you found any method so far to hard reboot instances?

Comment 13 Oyvind Albrigtsen 2020-03-18 14:27:37 UTC

The agent was written to perform hard-off, and we didnt see any sign that it didnt.

It sounds like it's a bigger issue with bare-metal (we didnt test on it).

Comment 14 Brandon Perkins 2020-03-18 14:35:41 UTC

Oyvind beat me to it, but as far as the forced power off, I know this works because this had to be heavily tested.  Plus, I have to use it in the test automation when deleting cloud clusters.  What is being done is found at:

https://github.com/ClusterLabs/fence-agents/blob/master/agents/aws/fence_aws.py#L51

which is:

conn.instances.filter(InstanceIds=[options["--plug"]]).stop(Force=True)

The important part here is the "Force=True".  In the boto3 documentation at:

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Instance.stop

there is a description of the Force parameter:

Forces the instances to stop. The instances do not have an opportunity to flush file system caches or file system metadata. If you use this option, you must perform file system check and repair procedures. This option is not recommended for Windows instances.

and in testing with VMs, this is exactly the behavior I saw.  I will admit, however, that bare metal testing was not performed (mostly because I didn't know this was an option with AWS).  I'll research how to create a bare metal cluster and add it to the test matrix.

Comment 15 Reid Wahl 2020-03-18 20:22:23 UTC

(In reply to John Ruemker from comment #12)
> My reading of this bug and the cases leads me to believe the following:
> 
> - The AWS endpoint our agent is calling initiates a soft shutdown 
> - This is NOT limited to bare-metal, and IS APPLICABLE to VMs.  
> 
> Am I correct?  If so, this seems like a big problem with high impact for any
> customers trying to use AWS.  

You are correct.


> Reid: Have you found any method so far to hard reboot instances?

No, and the customer (at AWS) has not provided any either. In December, they said they were unaware of a hard reboot method and would follow up internally. But there has been no concrete statement of an RFE being opened at the AWS end. (Worth follow-up.)


(In reply to Brandon Perkins from comment #14)
> Oyvind beat me to it, but as far as the forced power off, I know this works
> because this had to be heavily tested.  Plus, I have to use it in the test
> automation when deleting cloud clusters.  What is being done is found at:
> 
> https://github.com/ClusterLabs/fence-agents/blob/master/agents/aws/fence_aws.
> py#L51
> 
> which is:
> 
> conn.instances.filter(InstanceIds=[options["--plug"]]).stop(Force=True)
> 
> The important part here is the "Force=True".  In the boto3 documentation at:
> 
> https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/
> ec2.html#EC2.Instance.stop
> 
> there is a description of the Force parameter:
> 
> Forces the instances to stop. The instances do not have an opportunity to
> flush file system caches or file system metadata. If you use this option,
> you must perform file system check and repair procedures. This option is not
> recommended for Windows instances.
> 
> and in testing with VMs, this is exactly the behavior I saw.  I will admit,
> however, that bare metal testing was not performed (mostly because I didn't
> know this was an option with AWS).  I'll research how to create a bare metal
> cluster and add it to the test matrix.

This issue is NOT limited to bare-metal machines.

I understand that the way Force=True is documented would lead one to believe that it would initiate a hard reboot. Additionally, that would be the most sane and least astonishing behavior.

However, in practice, even with the Force=True option, the API initiates an OS-level shutdown within the guest, reportedly via ACPI button press. Setting HandlePowerKey=ignore in /etc/systemd/logind.conf did NOT change the graceful shutdown behavior, oddly.

This shutdown normally would happen pretty quickly, but in the case of a fence event triggered by a heartbeat network failure (NOT `pcs stonith fence`), Pacemaker hangs the graceful shutdown within the guest OS.

Comment 16 Oyvind Albrigtsen 2020-03-19 08:22:08 UTC

Correct. It's probably the equal of running "poweroff -nf":
       -f, --force
           Force immediate halt, power-off, or reboot. When specified once, this results in an immediate but clean shutdown by the system manager. When specified twice, this results
           in an immediate shutdown without contacting the system manager. See the description of --force in systemctl(1) for more details.

       -n, --no-sync
           Don't sync hard disks/storage media before halt, power-off, reboot.

Which made it hard to detect when we didnt manage to get into the situation of graceful shutdown Reid mentions.

Comment 17 Reid Wahl 2020-03-24 18:27:00 UTC

It sounds like there is no hard poweroff RFE at the AWS end. Customer filed a PR against fence-agents upstream, however.

I followed up with the customer as follows:
~~~
...
From December 6:

> At this moment I'm unaware of a method that will allow us to trigger a hard shutdown/stop unless we use a watchdog (HW or SW). I'm following up internally to understand if we have an alternative, but I wouldn't count on that and I would continue working with the resources we have at hand.


Has there been any news internally regarding this? For example, an RFE for a hard shutdown endpoint?
~~~


Response:
~~~
Hi Reid,

Thank you for following up with me.

Still nothing on my end, but I created and submitted a patch for fence_aws which would enhance logging experience, add a a new debug options and fix the fence race condition by using pcmk_max_delay:
- https://github.com/ClusterLabs/fence-agents/pull/323

Please, have a look at the PR and let me know if Red Hat can add these changes to your product for testing and future release.

Thank you,

- Felix, Amazon Web Services
~~~

Comment 18 Oyvind Albrigtsen 2020-03-25 09:22:00 UTC

I think Feist is in contact with them. Chris, you got any info on this?

Comment 20 Reid Wahl 2020-04-16 00:01:56 UTC

Have we made any inroads with AWS in getting an "immediate power-off" API endpoint?

Or do we have any thoughts regarding *safe* workarounds to the lack of such an endpoint? Interpreting "stopping" as "stopped" would probably work but is unsafe in case the fenced node is still accessing resources. I don't see any way around needing to see a "stopped" state before considering a node fenced.

Felix's ACPI workaround makes sense but I don't know that it's reliable in all situations.

Comment 21 Reid Wahl 2020-04-16 00:25:47 UTC

I spent the afternoon on a conference call with another customer for whom this extremely delayed shutdown is a showstopper for the high availability of their SAP deployment. They're using r5.xlarge and r5.2xlarge instances. They took several minutes to shut down.

I spun up a two-node cluster of r5.xlarge instances to test. In my case, the fenced node rejoined the cluster while it was still "stopping" from AWS's perspective.

Comment 22 Guilherme 2020-04-16 10:09:42 UTC

Hello,

I'm sorry for stepping in into your BZ but I would like to share a few things.

Based all of my tests it shouldn't take more than 4-5 minutes to stop an instance, and I've the cluster with the correct timeouts to wait up to 5 minutes (stonith-timeout, power_timeout, op defaults timeout, see KB#1386353). The default action for fence_aws is to reboot the instance, and not stop it, and the AWS web EC2 console won't update the instance state at every second (or minute) therefore it may have still appeared as "stopping" when it was probably already "running".

Also, the code committed in https://github.com/ClusterLabs/fence-agents/pull/323 will prevent a dual fence race as long as a pcmk+delay setting is used, but it won't fix the time it takes to stop an instance, so that is why I proposed an ACPI trap configuration as an interim mitigation as an "echo o > /proc/sysrq-trigger".

So I guess we may have two options for the moment:

1) Merge the latest upstream fence_aws into RHEL and enforce the use of pcmk+delay options to ensure no fence race will occur, AND tune the timeouts accordingly (4-5 minutes) -- this alone will resolve a known issue with the agent which is the dual fence race
2) Evaluate/test/validate the usage of ACPI daemon + ACPI signal trap to interface with kernel's sysrq for an immediate shutdown

The risks with #2 are the same as any userspace process, where the daemon could be stopped, killed, not properly configured and may cause the implementation to be a bit more complex by including additional steps.

Thoughts?

Comment 23 Oyvind Albrigtsen 2020-04-16 10:26:37 UTC

https://github.com/ClusterLabs/fence-agents/pull/323 is being fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1816203

Comment 33 RHEL Program Management 2023-09-22 18:35:02 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 34 RHEL Program Management 2023-09-22 18:36:34 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.