We had a dozen or so instances in the CI account that were stuck in shutting-down for days. Eric Paris opened an AWS ticket, but they didn't really investigate. Requesting a fresh termination removed the instances. Word from AWS folks is that being stuck for 15m or more is a sign of trouble, so logging a warning and re-terminating any instances at least that often seems reasonable. Sometimes re-terminating helps, and sometimes it doesn't (or maybe that didn't quite get as far as re-terminating?), but asking for a fresh termination every 15m or so doesn't seem like it would have negative consequences. Log line should definitely whine about AWS not terminating ("consider filing a ticket with AWS support"). Clone of https://issues.redhat.com/browse/CORS-1599
Recently, there were 6 separate CI clusters in us-west-2 that were all blocked by a shutting-down instance. Each instance had a State Transition Reason of Server.InternalError. Manually terminating the instances resolved the issue.
Hello Matthew, is there a way to reproduce this issue? I don't remember that we met this issue before, I just searched all instances under QE account, they are all `Terminated` or `Running`. Thanks.
(In reply to Yunfei Jiang from comment #3) > Hello Matthew, is there a way to reproduce this issue? I don't remember that > we met this issue before, I just searched all instances under QE account, > they are all `Terminated` or `Running`. > > Thanks. I unfortunately do not know of a way to reproduce this issue. It is something that happens very rarely due to AWS issues and not something that we control.
Hello Matthew, after this PR merged, have you met the issue again in your side? If this fix works well, I'm going setting status as VERIFIED, since it is related to AWS platform, and can not be reproduced on QE side.
(In reply to Yunfei Jiang from comment #5) > Hello Matthew, after this PR merged, have you met the issue again in your > side? If this fix works well, I'm going setting status as VERIFIED, since it > is related to AWS platform, and can not be reproduced on QE side. I only know of 2 cases in the past 6 months where there have been instances stuck shutting down. In both cases, there were multiple instances across multiple clusters, implying a temporary error in AWS itself. I have not seen the issue since the PR merged, but I have no indication whether AWS has had the issue or not since then, unfortunately.
thanks Matthew. Per comment 5 and comment 6, changing status to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438