1939054 – machine healthcheck kills aws spot instance before generated

Bug 1939054 - machine healthcheck kills aws spot instance before generated

Summary: machine healthcheck kills aws spot instance before generated

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-15 14:02 UTC by Alexander Niebuhr
Modified:	2021-10-20 07:16 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Instances may take some time to boot when an upgrade is performed by the Machine Config Daemon on first boot Consequence: The MHC would remove the Machine too early. This behaviour was not opt out. Fix: Default MHCs will no longer remove nodes because they haven't started correctly. Result: MHCs only remove nodes explicitly when requested.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:53:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 830	0	None	open	Bug 1939054: Disable startup timeout for Spot MHC	2021-03-18 12:45:11 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:53:42 UTC

Description Alexander Niebuhr 2021-03-15 14:02:18 UTC

Description of problem:
machine-api-termination-handler healthcheck new in 4.7 has 10m start timeout. This seems to kill aws spot instances, and landing in restart loop of worker nodes

Version-Release number of selected component (if applicable):
4.7


Additional info:
must-gather

https://drive.google.com/file/d/1lS2Eq9LlKuCBieNk1PVR6fVNDT7xoLhL/view?usp=sharing

Comment 1 Alexander Niebuhr 2021-03-15 14:36:53 UTC

changed aws base ami image from coreos 32 to 33.. seems to resolve it.. still waiting for our disaster tests

Comment 2 Joel Speed 2021-03-16 11:15:54 UTC

Is this a case of the AMI being older and therefore taking a longer period to go from EC2 create to running node?

Comment 3 Alexander Niebuhr 2021-03-16 20:58:03 UTC

yes seems to.. with newest AMI it works.. however we had to switch that AMI in the middle of the cluster upgrade.. so not tested it at all. .for now it looks kinda working, but wished to know this before or even update AMI automatically.. with upgrade too.

Comment 4 Joel Speed 2021-03-17 10:07:23 UTC

The concerning part about this is that if anyone has a similar issue with an older image, plus a MHC that covers normal machines, this would then affect all of their nodes.
I know we are working on updating boot images as part of the upgrade but I'm not sure when that is planned to ship.

Out of interest, did you try overwriting the `nodeStartupTimeout` with a longer value, IIRC, we don't actually enforce it

Comment 5 Alexander Niebuhr 2021-03-17 12:32:18 UTC

it was changed back to 10m (default value) after some seconds...

Comment 6 Alexander Niebuhr 2021-03-17 12:34:23 UTC

yeah we lost all of our worker nodes during the update.. had to enable master nodes to be workers also

Comment 7 Joel Speed 2021-03-18 13:49:03 UTC

I've created a PR to disable the node startup timeout for this specific MHC, that should prevent this from happening and blocking people in the future

Comment 8 Alexander Niebuhr 2021-03-19 08:36:32 UTC

yeah, very looking forward to auto update AMI. since i think it is expected in IPI infrastructure

Comment 9 Joel Speed 2021-04-12 16:38:27 UTC

I want to discuss this issue with the upstream community so that we are agreed on the approach before we go ahead, have created this issue https://github.com/kubernetes-sigs/cluster-api/issues/4468

Comment 10 Joel Speed 2021-04-19 15:54:29 UTC

We've made some progress on this upstream and will hopefully agree the solution with them soon

Comment 11 Alexander Niebuhr 2021-04-19 20:14:02 UTC

that's great. Looking forward for this to get implement in the upstream

Comment 13 sunzhaohua 2021-05-18 09:33:14 UTC

Verified
clusterversion: 4.8.0-0.nightly-2021-05-17-121817

nodeStartupTimeout is set to "0", if change it to other values, it will be changed back to "0" , didn't see the instance was deleted because node startup timeout.
$ oc edit mhc machine-api-termination-handler
spec:
  maxUnhealthy: 100%
  nodeStartupTimeout: "0"
  selector:
    matchLabels:
      machine.openshift.io/interruptible-instance: ""
  unhealthyConditions:
  - status: "True"
    timeout: 0s
    type: Terminating

Comment 16 errata-xmlrpc 2021-07-27 22:53:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.