Bug 1780515
Summary: | fence_aws: fenced node always performs a graceful shutdown | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> |
Component: | fence-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
Status: | CLOSED MIGRATED | QA Contact: | Brandon Perkins <bperkins> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 8.1 | CC: | arahmad, cluster-maint, ctowsley, fguilher, hbiswas, juholmes, oalbrigt, sanyadav, sbradley |
Target Milestone: | rc | Keywords: | MigratedToJIRA |
Target Release: | 8.1 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-09-22 18:36:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Reid Wahl
2019-12-06 09:09:42 UTC
(In reply to Reid Wahl from comment #0) Switch the Actual results with the Expected results. Instance start/stop behavior explanation: - Stop and Start Your Instance (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html) I have tried setting HandlePowerKey=ignore as described in our Product Docs (see below), but this did not help. - 5.11.2. Disabling ACPI Soft-Off in the logind.conf file (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/index#s2-acpi-disable-logind-CA) I've asked the customer about whether there's a way to initiate a hard power-off. Weird thing, now that I look at the node without the delay -- from the OS perspective, it seems to have rebooted about two seconds after token loss. It still shows "stopping" in the AWS console for a while after this. Node 2: Dec 6 09:23:33 ip-10-0-0-19 corosync[1180]: [KNET ] link: host: 1 link: 0 is down Dec 6 09:23:33 ip-10-0-0-19 corosync[1180]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Dec 6 09:23:33 ip-10-0-0-19 corosync[1180]: [KNET ] host: host: 1 has no active links Dec 6 09:23:33 ip-10-0-0-19 corosync[1180]: [TOTEM ] Token has not been received in 750 ms Dec 6 09:23:34 ip-10-0-0-19 corosync[1180]: [TOTEM ] A processor failed, forming new configuration. Dec 6 09:23:35 ip-10-0-0-19 corosync[1180]: [TOTEM ] A new membership (2:184) was formed. Members left: 1 Dec 6 09:23:35 ip-10-0-0-19 corosync[1180]: [TOTEM ] Failed to receive the leave message. failed: 1 Dec 6 09:23:35 ip-10-0-0-19 corosync[1180]: [CPG ] downlist left_list: 1 received Dec 6 09:23:35 ip-10-0-0-19 corosync[1180]: [QUORUM] Members[1]: 2 Dec 6 09:23:35 ip-10-0-0-19 corosync[1180]: [MAIN ] Completed service synchronization, ready to provide service. ... Dec 6 09:23:35 ip-10-0-0-19 pacemaker-controld[1315]: notice: Requesting fencing (reboot) of node node1 Dec 6 09:23:35 ip-10-0-0-19 pacemaker-controld[1315]: notice: Initiating start operation node1_fence_start_0 locally on node2 Dec 6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: Client pacemaker-controld.1315.e6e254a6 wants to fence (reboot) 'node1' with device '(any)' Dec 6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: Requesting peer fencing (reboot) of node1 Dec 6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: node1_fence is eligible to fence (reboot) node1 (aka. 'i-0c5a1d7b5fc91764d'): static-list Dec 6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: node2_fence is not eligible to fence (reboot) node1: static-list Dec 6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: node1_fence is eligible to fence (reboot) node1 (aka. 'i-0c5a1d7b5fc91764d'): static-list Dec 6 09:23:35 ip-10-0-0-19 pacemaker-fenced[1311]: notice: node2_fence is not eligible to fence (reboot) node1: static-list ... Dec 6 09:23:37 ip-10-0-0-19 systemd[1]: Stopped Dynamic System Tuning Daemon. Dec 6 09:26:12 ip-10-0-0-19 kernel: Linux version 4.18.0-147.0.3.el8_1.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20190507 (Red Hat 8.3.1-4) (GCC)) #1 SMP Mon Nov 11 12:58:36 UTC 2019 Node 1 was gracefully rebooted around 09:24:38. Maybe shutdown hung on node 2 above due to pacemaker waiting to fence, and this is just where the logs stopped. Also, just realized I'm on RHEL 8 and the customer is on RHEL 7. I'll need to test on RHEL 7 and see if the bug presents itself in the same way. This is apparently caused by the slow shutdown as suspected. I printed the output of each get_power_status() call and tuned the delay timer of node 1's stonith device. The delay had to be longer than the time it takes node 2 to go from "stopping" to "stopped". Despite the fact that node 2 received an ACPI shutdown signal and killed its sshd process almost immediately, pacemaker is apparently in full swing (and able to fence node 1) until node 2 goes to "stopped" power state. Node 2's typically takes more than two minutes to stop completely. A delay of 180 on node 1's stonith device prevented it from being powered off by node 2. The customer said they're not aware of any way to issue a trigger a true hard power-off. I've requested that they reach out to their relevant dev team to determine the feasibility of adding such a feature, and they're inquiring about any existing alternatives that may help. Meanwhile if either we or AWS can figure out a way to get the node to stop faster, that may take care of the issue. The customer (and the AWS documentation) say that the instance stop is implemented by an ACPI shutdown signal to the OS. I've tried both of these, with no improvement: - https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/index#s2-acpi-disable-logind-CA - https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/index#s2-acpi-disable-boot-CA The customer also put forth the following: ~~~ I had some internal discussions with some very savvy HA AWS consultants and the idea is to have a timer started when the instance switches to "Stopping", and if after this timer expires regardless of the instance moving to "Stopped" the fencing agent would return OK and let the cluster move on with the failover. The main reasons are: 1) After the instance transition to "Stopping" it won't ever be back into "Running" state 2) It is guaranteed that the instance receives an ACPI signal and the OS will start the shutdown sequence (as long as it doesn't ignore the ACPI signal), otherwise it will take much longer for the instance to actually stop ~~~ I replied with why that's not a viable option IMO (full response in case), but basically comes down to this: we know for sure that "stopping" state doesn't guarantee a node has relinquished its access to shared resources, because node 2 in "stopping" state is still "alive" enough to fence node 1 after the delay. (And of course it doesn't do any good to declare node 2 as fenced, if it turns around and fences node 1.) Linked case 02535181, which is for fencing timing out on bare metal instances, which the customer reports can take up to 10 minutes to stop. For now I'm tracking that issue via this same BZ, since a true hard power-off should be able to resolve both issues. Did a little more investigation. Pacemaker is what's hanging the graceful shutdown process on node 2. I configured the rsyslog.service unit with `Before=pacemaker.service` so that messages would keep writing to the logs during the shutdown process until pacemaker stopped successfully. We already knew that pacemaker wasn't stopping completely for a while, since it was still able to fence node 1 after a more than 60-second delay. Here are the logs that were gathered on node 2 during shutdown: ~~~ Dec 11 22:14:44 ip-10-0-0-19 corosync[1191]: [TOTEM ] A processor failed, forming new configuration. Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]: [TOTEM ] A new membership (2:392) was formed. Members left: 1 Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]: [TOTEM ] Failed to receive the leave message. failed: 1 Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]: [CPG ] downlist left_list: 1 received Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]: [QUORUM] Members[1]: 2 Dec 11 22:14:45 ip-10-0-0-19 corosync[1191]: [MAIN ] Completed service synchronization, ready to provide service. Dec 11 22:14:45 ip-10-0-0-19 pacemaker-based[1321]: notice: Node node1 state is now lost Dec 11 22:14:45 ip-10-0-0-19 pacemaker-controld[1326]: notice: Requesting fencing (reboot) of node node1 [There is a delay configured for node1, so this doesn't do anything yet.] [I didn't know that a stonith device would migrate before fencing had completed. That may be unexpected.] Dec 11 22:14:45 ip-10-0-0-19 pacemaker-controld[1326]: notice: Initiating start operation node1_fence_start_0 locally on node2 Dec 11 22:14:46 ip-10-0-0-19 pacemaker-controld[1326]: notice: Result of start operation for node1_fence on node2: 0 (ok) Dec 11 22:14:46 ip-10-0-0-19 pacemaker-controld[1326]: notice: Initiating monitor operation node1_fence_monitor_60000 locally on node2 [Node2 begins gracefully shutting down.] Dec 11 22:14:47 ip-10-0-0-19 systemd[1]: Stopped target Host and Network Name Lookups. ... Dec 11 22:14:47 ip-10-0-0-19 pacemakerd[1318]: notice: Caught 'Terminated' signal Dec 11 22:14:47 ip-10-0-0-19 pacemakerd[1318]: notice: Shutting down Pacemaker Dec 11 22:14:47 ip-10-0-0-19 pacemakerd[1318]: notice: Stopping pacemaker-controld Dec 11 22:14:47 ip-10-0-0-19 systemd[1]: Stopping Pacemaker High Availability Cluster Manager... Dec 11 22:14:47 ip-10-0-0-19 pacemaker-controld[1326]: notice: Caught 'Terminated' signal Dec 11 22:14:47 ip-10-0-0-19 pacemaker-controld[1326]: notice: Shutting down cluster resource manager Dec 11 22:14:47 ip-10-0-0-19 pacemaker-controld[1326]: warning: Input I_SHUTDOWN received in state S_TRANSITION_ENGINE from crm_shutdown Dec 11 22:14:47 ip-10-0-0-19 pacemaker-controld[1326]: notice: Transition 1 aborted by transient_attributes.2 'create': Transient attribute change Dec 11 22:14:47 ip-10-0-0-19 systemd[1532]: Reached target Shutdown. Dec 11 22:14:47 ip-10-0-0-19 systemd[1532]: Starting Exit the Session... [More than 75 seconds later, pacemaker still has not shut down.] Dec 11 22:16:06 ip-10-0-0-19 pacemaker-controld[1326]: error: Node node2 did not send monitor result (via controller) within 20000ms (action timeout plus cluster-delay) Dec 11 22:16:06 ip-10-0-0-19 pacemaker-controld[1326]: error: [Action 5]: In-flight rsc op node1_fence_monitor_60000 on node2 (priority: 0, waiting: none) Dec 11 22:16:06 ip-10-0-0-19 pacemaker-controld[1326]: warning: rsc_op 5: node1_fence_monitor_60000 on node2 timed out Dec 11 22:17:33 ip-10-0-0-19 kernel: Command line: BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-147.0.3.el8_1.x86_64 root=UUID=a727b695-0c21-404a-b42b-3075c8deb6ab ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau crashkernel=auto acpi=off ~~~ Presumably pacemaker shutdown is hung due to (a) the pending fence event against node1, and/or (b) the monitor operation for node1_fence. I was able to circumvent this hang by setting the shutdown-escalation cluster property (I chose a value of 15s), forcing pacemaker to give up on a graceful stop after the shutdown-escalation timer expires (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/ch-clusteropts-haar). Overriding the shutdown-escalation default is not an ideal resolution by any means, but it appears to be a valid workaround for virtual instances. Two notes: 1) I doubt this would completely resolve the fence timeout issue for bare metal instances, since there are likely other factors besides pacemaker that are causing slow shutdowns of those instances. As mentioned privately, I do not have a bare metal test environment. The customer can help test fixes on bare metal if needed. 2) I still have reason to believe that if we get a hard-reset option, it would resolve the issue in the most satisfactory way for both virtual and bare metal instances. The customer proposed a workaround that does seem to work under at least some circumstances. 1) Install the acpid package. 2) Make a backup of /etc/acpi/actions/power.sh 3) Overwrite the power.sh script with a script that simply does `echo b > /proc/sysrq-trigger`. 4) Start and enable the acpid.service systemd unit. The idea is that the EC2 StopInstance call sends the EC2 instance an ACPI button press. This action handler script forces an immediate reboot. My concern is that since this approach relies on the acpid daemon, it may not work under some circumstances, such as a completely hung node. I've reiterated to the customer that the ideal resolution would be for AWS to add a hard-reset or hard-poweroff API endpoint, similar to what IPMI devices provide for physical nodes and VMware and KVM provide for their virtual guests. My reading of this bug and the cases leads me to believe the following: - The AWS endpoint our agent is calling initiates a soft shutdown - This is NOT limited to bare-metal, and IS APPLICABLE to VMs. Am I correct? If so, this seems like a big problem with high impact for any customers trying to use AWS. Oyvind: Do you know whether the agent as-written was expected to perform a hard power-off?? Reid: Have you found any method so far to hard reboot instances? The agent was written to perform hard-off, and we didnt see any sign that it didnt. It sounds like it's a bigger issue with bare-metal (we didnt test on it). Oyvind beat me to it, but as far as the forced power off, I know this works because this had to be heavily tested. Plus, I have to use it in the test automation when deleting cloud clusters. What is being done is found at: https://github.com/ClusterLabs/fence-agents/blob/master/agents/aws/fence_aws.py#L51 which is: conn.instances.filter(InstanceIds=[options["--plug"]]).stop(Force=True) The important part here is the "Force=True". In the boto3 documentation at: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Instance.stop there is a description of the Force parameter: Forces the instances to stop. The instances do not have an opportunity to flush file system caches or file system metadata. If you use this option, you must perform file system check and repair procedures. This option is not recommended for Windows instances. and in testing with VMs, this is exactly the behavior I saw. I will admit, however, that bare metal testing was not performed (mostly because I didn't know this was an option with AWS). I'll research how to create a bare metal cluster and add it to the test matrix. (In reply to John Ruemker from comment #12) > My reading of this bug and the cases leads me to believe the following: > > - The AWS endpoint our agent is calling initiates a soft shutdown > - This is NOT limited to bare-metal, and IS APPLICABLE to VMs. > > Am I correct? If so, this seems like a big problem with high impact for any > customers trying to use AWS. You are correct. > Reid: Have you found any method so far to hard reboot instances? No, and the customer (at AWS) has not provided any either. In December, they said they were unaware of a hard reboot method and would follow up internally. But there has been no concrete statement of an RFE being opened at the AWS end. (Worth follow-up.) (In reply to Brandon Perkins from comment #14) > Oyvind beat me to it, but as far as the forced power off, I know this works > because this had to be heavily tested. Plus, I have to use it in the test > automation when deleting cloud clusters. What is being done is found at: > > https://github.com/ClusterLabs/fence-agents/blob/master/agents/aws/fence_aws. > py#L51 > > which is: > > conn.instances.filter(InstanceIds=[options["--plug"]]).stop(Force=True) > > The important part here is the "Force=True". In the boto3 documentation at: > > https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ > ec2.html#EC2.Instance.stop > > there is a description of the Force parameter: > > Forces the instances to stop. The instances do not have an opportunity to > flush file system caches or file system metadata. If you use this option, > you must perform file system check and repair procedures. This option is not > recommended for Windows instances. > > and in testing with VMs, this is exactly the behavior I saw. I will admit, > however, that bare metal testing was not performed (mostly because I didn't > know this was an option with AWS). I'll research how to create a bare metal > cluster and add it to the test matrix. This issue is NOT limited to bare-metal machines. I understand that the way Force=True is documented would lead one to believe that it would initiate a hard reboot. Additionally, that would be the most sane and least astonishing behavior. However, in practice, even with the Force=True option, the API initiates an OS-level shutdown within the guest, reportedly via ACPI button press. Setting HandlePowerKey=ignore in /etc/systemd/logind.conf did NOT change the graceful shutdown behavior, oddly. This shutdown normally would happen pretty quickly, but in the case of a fence event triggered by a heartbeat network failure (NOT `pcs stonith fence`), Pacemaker hangs the graceful shutdown within the guest OS. Correct. It's probably the equal of running "poweroff -nf": -f, --force Force immediate halt, power-off, or reboot. When specified once, this results in an immediate but clean shutdown by the system manager. When specified twice, this results in an immediate shutdown without contacting the system manager. See the description of --force in systemctl(1) for more details. -n, --no-sync Don't sync hard disks/storage media before halt, power-off, reboot. Which made it hard to detect when we didnt manage to get into the situation of graceful shutdown Reid mentions. It sounds like there is no hard poweroff RFE at the AWS end. Customer filed a PR against fence-agents upstream, however. I followed up with the customer as follows: ~~~ ... From December 6: > At this moment I'm unaware of a method that will allow us to trigger a hard shutdown/stop unless we use a watchdog (HW or SW). I'm following up internally to understand if we have an alternative, but I wouldn't count on that and I would continue working with the resources we have at hand. Has there been any news internally regarding this? For example, an RFE for a hard shutdown endpoint? ~~~ Response: ~~~ Hi Reid, Thank you for following up with me. Still nothing on my end, but I created and submitted a patch for fence_aws which would enhance logging experience, add a a new debug options and fix the fence race condition by using pcmk_max_delay: - https://github.com/ClusterLabs/fence-agents/pull/323 Please, have a look at the PR and let me know if Red Hat can add these changes to your product for testing and future release. Thank you, - Felix, Amazon Web Services ~~~ I think Feist is in contact with them. Chris, you got any info on this? Have we made any inroads with AWS in getting an "immediate power-off" API endpoint? Or do we have any thoughts regarding *safe* workarounds to the lack of such an endpoint? Interpreting "stopping" as "stopped" would probably work but is unsafe in case the fenced node is still accessing resources. I don't see any way around needing to see a "stopped" state before considering a node fenced. Felix's ACPI workaround makes sense but I don't know that it's reliable in all situations. I spent the afternoon on a conference call with another customer for whom this extremely delayed shutdown is a showstopper for the high availability of their SAP deployment. They're using r5.xlarge and r5.2xlarge instances. They took several minutes to shut down. I spun up a two-node cluster of r5.xlarge instances to test. In my case, the fenced node rejoined the cluster while it was still "stopping" from AWS's perspective. Hello, I'm sorry for stepping in into your BZ but I would like to share a few things. Based all of my tests it shouldn't take more than 4-5 minutes to stop an instance, and I've the cluster with the correct timeouts to wait up to 5 minutes (stonith-timeout, power_timeout, op defaults timeout, see KB#1386353). The default action for fence_aws is to reboot the instance, and not stop it, and the AWS web EC2 console won't update the instance state at every second (or minute) therefore it may have still appeared as "stopping" when it was probably already "running". Also, the code committed in https://github.com/ClusterLabs/fence-agents/pull/323 will prevent a dual fence race as long as a pcmk+delay setting is used, but it won't fix the time it takes to stop an instance, so that is why I proposed an ACPI trap configuration as an interim mitigation as an "echo o > /proc/sysrq-trigger". So I guess we may have two options for the moment: 1) Merge the latest upstream fence_aws into RHEL and enforce the use of pcmk+delay options to ensure no fence race will occur, AND tune the timeouts accordingly (4-5 minutes) -- this alone will resolve a known issue with the agent which is the dual fence race 2) Evaluate/test/validate the usage of ACPI daemon + ACPI signal trap to interface with kernel's sysrq for an immediate shutdown The risks with #2 are the same as any userspace process, where the daemon could be stopped, killed, not properly configured and may cause the implementation to be a bit more complex by including additional steps. Thoughts? https://github.com/ClusterLabs/fence-agents/pull/323 is being fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1816203 Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |