Bug 1957951

Summary:	[aws] destroy can get blocked on instances stuck in shutting-down state
Product:	OpenShift Container Platform	Reporter:	Matthew Staebler <mstaeble>
Component:	Installer	Assignee:	Aditya Narayanaswamy <anarayan>
Installer sub component:	openshift-installer	QA Contact:	Yunfei Jiang <yunjiang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified
Version:	4.8
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Some of the instances in AWS were stuck in shutting-down state and were never terminated. In order to make sure that all the instances are removed, a fresh termination will now be requested after 10 minutes to ensure that they are destroyed.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 23:07:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Staebler 2021-05-06 19:54:49 UTC

We had a dozen or so instances in the CI account that were stuck in shutting-down for days. Eric Paris opened an AWS ticket, but they didn't really investigate. Requesting a fresh termination removed the instances. Word from AWS folks is that being stuck for 15m or more is a sign of trouble, so logging a warning and re-terminating any instances at least that often seems reasonable.

Sometimes re-terminating helps, and sometimes it doesn't (or maybe that didn't quite get as far as re-terminating?), but asking for a fresh termination every 15m or so doesn't seem like it would have negative consequences. Log line should definitely whine about AWS not terminating ("consider filing a ticket with AWS support").

Clone of https://issues.redhat.com/browse/CORS-1599

Comment 1 Matthew Staebler 2021-05-06 19:56:19 UTC

Recently, there were 6 separate CI clusters in us-west-2 that were all blocked by a shutting-down instance. Each instance had a State Transition Reason of Server.InternalError. Manually terminating the instances resolved the issue.

Comment 3 Yunfei Jiang 2021-05-07 08:05:43 UTC

Hello Matthew, is there a way to reproduce this issue? I don't remember that we met this issue before, I just searched all instances under QE account, they are all `Terminated` or `Running`.

Thanks.

Comment 4 Matthew Staebler 2021-05-07 15:56:14 UTC

(In reply to Yunfei Jiang from comment #3)
> Hello Matthew, is there a way to reproduce this issue? I don't remember that
> we met this issue before, I just searched all instances under QE account,
> they are all `Terminated` or `Running`.
> 
> Thanks.

I unfortunately do not know of a way to reproduce this issue. It is something that happens very rarely due to AWS issues and not something that we control.

Comment 5 Yunfei Jiang 2021-05-11 01:21:39 UTC

Hello Matthew, after this PR merged, have you met the issue again in your side? If this fix works well, I'm going setting status as VERIFIED, since it is related to AWS platform, and can not be reproduced on QE side.

Comment 6 Matthew Staebler 2021-05-11 14:07:57 UTC

(In reply to Yunfei Jiang from comment #5)
> Hello Matthew, after this PR merged, have you met the issue again in your
> side? If this fix works well, I'm going setting status as VERIFIED, since it
> is related to AWS platform, and can not be reproduced on QE side.

I only know of 2 cases in the past 6 months where there have been instances stuck shutting down. In both cases, there were multiple instances across multiple clusters, implying a temporary error in AWS itself. I have not seen the issue since the PR merged, but I have no indication whether AWS has had the issue or not since then, unfortunately.

Comment 7 Yunfei Jiang 2021-05-12 01:06:49 UTC

thanks Matthew.
Per comment 5 and comment 6, changing status to VERIFIED.

Comment 10 errata-xmlrpc 2021-07-27 23:07:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438