1957951 – [aws] destroy can get blocked on instances stuck in shutting-down state

Bug 1957951 - [aws] destroy can get blocked on instances stuck in shutting-down state

Summary: [aws] destroy can get blocked on instances stuck in shutting-down state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Aditya Narayanaswamy
QA Contact:	Yunfei Jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-06 19:54 UTC by Matthew Staebler
Modified:	2021-07-27 23:07 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Some of the instances in AWS were stuck in shutting-down state and were never terminated. In order to make sure that all the instances are removed, a fresh termination will now be requested after 10 minutes to ensure that they are destroyed.
Clone Of:
Environment:
Last Closed:	2021-07-27 23:07:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 4848	0	None	open	Bug 1957951: AWS: Periodically send shut down requests for stuck EC2 instances	2021-05-06 19:57:00 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:07:22 UTC

Description Matthew Staebler 2021-05-06 19:54:49 UTC

We had a dozen or so instances in the CI account that were stuck in shutting-down for days. Eric Paris opened an AWS ticket, but they didn't really investigate. Requesting a fresh termination removed the instances. Word from AWS folks is that being stuck for 15m or more is a sign of trouble, so logging a warning and re-terminating any instances at least that often seems reasonable.

Sometimes re-terminating helps, and sometimes it doesn't (or maybe that didn't quite get as far as re-terminating?), but asking for a fresh termination every 15m or so doesn't seem like it would have negative consequences. Log line should definitely whine about AWS not terminating ("consider filing a ticket with AWS support").

Clone of https://issues.redhat.com/browse/CORS-1599

Comment 1 Matthew Staebler 2021-05-06 19:56:19 UTC

Recently, there were 6 separate CI clusters in us-west-2 that were all blocked by a shutting-down instance. Each instance had a State Transition Reason of Server.InternalError. Manually terminating the instances resolved the issue.

Comment 3 Yunfei Jiang 2021-05-07 08:05:43 UTC

Hello Matthew, is there a way to reproduce this issue? I don't remember that we met this issue before, I just searched all instances under QE account, they are all `Terminated` or `Running`.

Thanks.

Comment 4 Matthew Staebler 2021-05-07 15:56:14 UTC

(In reply to Yunfei Jiang from comment #3)
> Hello Matthew, is there a way to reproduce this issue? I don't remember that
> we met this issue before, I just searched all instances under QE account,
> they are all `Terminated` or `Running`.
> 
> Thanks.

I unfortunately do not know of a way to reproduce this issue. It is something that happens very rarely due to AWS issues and not something that we control.

Comment 5 Yunfei Jiang 2021-05-11 01:21:39 UTC

Hello Matthew, after this PR merged, have you met the issue again in your side? If this fix works well, I'm going setting status as VERIFIED, since it is related to AWS platform, and can not be reproduced on QE side.

Comment 6 Matthew Staebler 2021-05-11 14:07:57 UTC

(In reply to Yunfei Jiang from comment #5)
> Hello Matthew, after this PR merged, have you met the issue again in your
> side? If this fix works well, I'm going setting status as VERIFIED, since it
> is related to AWS platform, and can not be reproduced on QE side.

I only know of 2 cases in the past 6 months where there have been instances stuck shutting down. In both cases, there were multiple instances across multiple clusters, implying a temporary error in AWS itself. I have not seen the issue since the PR merged, but I have no indication whether AWS has had the issue or not since then, unfortunately.

Comment 7 Yunfei Jiang 2021-05-12 01:06:49 UTC

thanks Matthew.
Per comment 5 and comment 6, changing status to VERIFIED.

Comment 10 errata-xmlrpc 2021-07-27 23:07:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.