Bug 1957951 - [aws] destroy can get blocked on instances stuck in shutting-down state
Summary: [aws] destroy can get blocked on instances stuck in shutting-down state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.8.0
Assignee: Aditya Narayanaswamy
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-06 19:54 UTC by Matthew Staebler
Modified: 2021-07-27 23:07 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Some of the instances in AWS were stuck in shutting-down state and were never terminated. In order to make sure that all the instances are removed, a fresh termination will now be requested after 10 minutes to ensure that they are destroyed.
Clone Of:
Environment:
Last Closed: 2021-07-27 23:07:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4848 0 None open Bug 1957951: AWS: Periodically send shut down requests for stuck EC2 instances 2021-05-06 19:57:00 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:07:22 UTC

Description Matthew Staebler 2021-05-06 19:54:49 UTC
We had a dozen or so instances in the CI account that were stuck in shutting-down for days. Eric Paris opened an AWS ticket, but they didn't really investigate. Requesting a fresh termination removed the instances. Word from AWS folks is that being stuck for 15m or more is a sign of trouble, so logging a warning and re-terminating any instances at least that often seems reasonable.

Sometimes re-terminating helps, and sometimes it doesn't (or maybe that didn't quite get as far as re-terminating?), but asking for a fresh termination every 15m or so doesn't seem like it would have negative consequences. Log line should definitely whine about AWS not terminating ("consider filing a ticket with AWS support").

Clone of https://issues.redhat.com/browse/CORS-1599

Comment 1 Matthew Staebler 2021-05-06 19:56:19 UTC
Recently, there were 6 separate CI clusters in us-west-2 that were all blocked by a shutting-down instance. Each instance had a State Transition Reason of Server.InternalError. Manually terminating the instances resolved the issue.

Comment 3 Yunfei Jiang 2021-05-07 08:05:43 UTC
Hello Matthew, is there a way to reproduce this issue? I don't remember that we met this issue before, I just searched all instances under QE account, they are all `Terminated` or `Running`.

Thanks.

Comment 4 Matthew Staebler 2021-05-07 15:56:14 UTC
(In reply to Yunfei Jiang from comment #3)
> Hello Matthew, is there a way to reproduce this issue? I don't remember that
> we met this issue before, I just searched all instances under QE account,
> they are all `Terminated` or `Running`.
> 
> Thanks.

I unfortunately do not know of a way to reproduce this issue. It is something that happens very rarely due to AWS issues and not something that we control.

Comment 5 Yunfei Jiang 2021-05-11 01:21:39 UTC
Hello Matthew, after this PR merged, have you met the issue again in your side? If this fix works well, I'm going setting status as VERIFIED, since it is related to AWS platform, and can not be reproduced on QE side.

Comment 6 Matthew Staebler 2021-05-11 14:07:57 UTC
(In reply to Yunfei Jiang from comment #5)
> Hello Matthew, after this PR merged, have you met the issue again in your
> side? If this fix works well, I'm going setting status as VERIFIED, since it
> is related to AWS platform, and can not be reproduced on QE side.

I only know of 2 cases in the past 6 months where there have been instances stuck shutting down. In both cases, there were multiple instances across multiple clusters, implying a temporary error in AWS itself. I have not seen the issue since the PR merged, but I have no indication whether AWS has had the issue or not since then, unfortunately.

Comment 7 Yunfei Jiang 2021-05-12 01:06:49 UTC
thanks Matthew.
Per comment 5 and comment 6, changing status to VERIFIED.

Comment 10 errata-xmlrpc 2021-07-27 23:07:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.