Bug 1939054
Summary: | machine healthcheck kills aws spot instance before generated | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alexander Niebuhr <alexander> |
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | unspecified | CC: | miyadav |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Instances may take some time to boot when an upgrade is performed by the Machine Config Daemon on first boot
Consequence: The MHC would remove the Machine too early. This behaviour was not opt out.
Fix: Default MHCs will no longer remove nodes because they haven't started correctly.
Result: MHCs only remove nodes explicitly when requested.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:53:18 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alexander Niebuhr
2021-03-15 14:02:18 UTC
changed aws base ami image from coreos 32 to 33.. seems to resolve it.. still waiting for our disaster tests Is this a case of the AMI being older and therefore taking a longer period to go from EC2 create to running node? yes seems to.. with newest AMI it works.. however we had to switch that AMI in the middle of the cluster upgrade.. so not tested it at all. .for now it looks kinda working, but wished to know this before or even update AMI automatically.. with upgrade too. The concerning part about this is that if anyone has a similar issue with an older image, plus a MHC that covers normal machines, this would then affect all of their nodes. I know we are working on updating boot images as part of the upgrade but I'm not sure when that is planned to ship. Out of interest, did you try overwriting the `nodeStartupTimeout` with a longer value, IIRC, we don't actually enforce it it was changed back to 10m (default value) after some seconds... yeah we lost all of our worker nodes during the update.. had to enable master nodes to be workers also I've created a PR to disable the node startup timeout for this specific MHC, that should prevent this from happening and blocking people in the future yeah, very looking forward to auto update AMI. since i think it is expected in IPI infrastructure I want to discuss this issue with the upstream community so that we are agreed on the approach before we go ahead, have created this issue https://github.com/kubernetes-sigs/cluster-api/issues/4468 We've made some progress on this upstream and will hopefully agree the solution with them soon that's great. Looking forward for this to get implement in the upstream Verified clusterversion: 4.8.0-0.nightly-2021-05-17-121817 nodeStartupTimeout is set to "0", if change it to other values, it will be changed back to "0" , didn't see the instance was deleted because node startup timeout. $ oc edit mhc machine-api-termination-handler spec: maxUnhealthy: 100% nodeStartupTimeout: "0" selector: matchLabels: machine.openshift.io/interruptible-instance: "" unhealthyConditions: - status: "True" timeout: 0s type: Terminating Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |