Bug 1840581 - New machine is deleted after provisioning
Summary: New machine is deleted after provisioning
Keywords:
Status: CLOSED DUPLICATE of bug 1868104
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.6.0
Assignee: Beth White
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-27 09:17 UTC by vsibirsk
Modified: 2020-09-03 12:42 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-03 12:42:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
machinehealthcheck log (277.11 KB, text/plain)
2020-05-27 09:17 UTC, vsibirsk
no flags Details

Description vsibirsk 2020-05-27 09:17:24 UTC
Created attachment 1692609 [details]
machinehealthcheck log

Description of problem:
New machine is deleted every 10min since new node is not associated with it (node installation takes more than 10min)

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Configure MHC object
2."kill" one of the nodes (stop kubelet service)

Actual results:
New machine is provisioned, but deleted after 10min since the new node is not yet installed (and this goes into loop)

Expected results:
Machine waits for node to be installed

Additional info:

Comment 1 Stephen Benjamin 2020-05-27 11:26:37 UTC
From the e-mail thread:

> I found the place it needs to be configured:
   https://github.com/openshift/machine-api-operator/blob/master/pkg/apis/machine/v1beta1/machinehealthcheck_types.go#L70

Comment 4 Zane Bitter 2020-08-24 22:15:47 UTC
The Machine phases go from Provisioning->Provisioned->Running, and the timeout applies to each phase (*not* the aggregate time since the Machine was created, as intimated in the code comment). So if the host spends more than the timeout time in either Provisioning or Provisioned, then we will attempt to remediate it. The timeout will likely need to be longer than on any other platform, simply because provisioning baremetal tends to take a long time - more than 10 minutes is routine.

However, there is an additional thing that could be addressed: currently the CAPBM transitions the Machine to the Provisioned phase as soon as it has selected a Host to deploy to. So the Provisioning phase will always be very short, while the Provisioned phase will be very very long since it encompasses the time all the way from when a Host is selected to when a kubelet is up and running on it and the Node has been linked to the Host.

So one thing the CAPBM should do is remain in the Provisioning state until the baremetal-operator decides that the Host is provisioned, and only then move to Provisioned. That would mean the timeout would not need to be increased by as much. I am planning to implement this anyway as part of fixes for bug 1868104.

Comment 5 Michael Gugino 2020-08-28 16:00:34 UTC
(In reply to Zane Bitter from comment #4)
> The Machine phases go from Provisioning->Provisioned->Running, and the
> timeout applies to each phase (*not* the aggregate time since the Machine
> was created, as intimated in the code comment). So if the host spends more
> than the timeout time in either Provisioning or Provisioned, then we will
> attempt to remediate it. The timeout will likely need to be longer than on
> any other platform, simply because provisioning baremetal tends to take a
> long time - more than 10 minutes is routine.
> 
> However, there is an additional thing that could be addressed: currently the
> CAPBM transitions the Machine to the Provisioned phase as soon as it has
> selected a Host to deploy to. So the Provisioning phase will always be very
> short, while the Provisioned phase will be very very long since it
> encompasses the time all the way from when a Host is selected to when a
> kubelet is up and running on it and the Node has been linked to the Host.
> 
> So one thing the CAPBM should do is remain in the Provisioning state until
> the baremetal-operator decides that the Host is provisioned, and only then
> move to Provisioned. That would mean the timeout would not need to be
> increased by as much. I am planning to implement this anyway as part of
> fixes for bug 1868104.

Provisioned is set by the machine-controller, not the actuator.  That setting is based on having an instance ID, and networking information.

I suggest setting MHC to 55 minutes, 10 minutes is way too low for BM.

Comment 6 Zane Bitter 2020-08-28 16:18:55 UTC
(In reply to Michael Gugino from comment #5)
> Provisioned is set by the machine-controller, not the actuator.

Yes, it would have been more accurate to say that the Provisioner induces the machine controller to change the state.

> That
> setting is based on having an instance ID, and networking information.

It's based on Exists() returning true AND either having an instance ID OR networking information. Currently the CAPBM provides all of these things as soon as it has selected a Host to provision.

Comment 8 Beth White 2020-09-03 12:42:01 UTC
Due to the workaround fixing the issue and Zane's comment:

"one thing the CAPBM should do is remain in the Provisioning state until the baremetal-operator decides that the Host is provisioned, and only then move to Provisioned. That would mean the timeout would not need to be increased by as much. I am planning to implement this anyway as part of fixes for bug 1868104."

I am marking this as closed duplicate of 1868104 since the fixes for that bug will also fix this one.

*** This bug has been marked as a duplicate of bug 1868104 ***


Note You need to log in before you can comment on or make changes to this bug.