Bug 1840581

Summary: New machine is deleted after provisioning
Product: OpenShift Container Platform Reporter: vsibirsk
Component: Cloud ComputeAssignee: Beth White <beth.white>
Cloud Compute sub component: BareMetal Provider QA Contact: Raviv Bar-Tal <rbartal>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: medium CC: beth.white, danken, dhellmann, ipinto, mgugino, rgarcia, stbenjam, zbitter
Version: 4.5Keywords: TestBlockerForLayeredProduct, Triaged
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-03 12:42:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
machinehealthcheck log none

Description vsibirsk 2020-05-27 09:17:24 UTC
Created attachment 1692609 [details]
machinehealthcheck log

Description of problem:
New machine is deleted every 10min since new node is not associated with it (node installation takes more than 10min)

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Configure MHC object
2."kill" one of the nodes (stop kubelet service)

Actual results:
New machine is provisioned, but deleted after 10min since the new node is not yet installed (and this goes into loop)

Expected results:
Machine waits for node to be installed

Additional info:

Comment 1 Stephen Benjamin 2020-05-27 11:26:37 UTC
From the e-mail thread:

> I found the place it needs to be configured:
   https://github.com/openshift/machine-api-operator/blob/master/pkg/apis/machine/v1beta1/machinehealthcheck_types.go#L70

Comment 4 Zane Bitter 2020-08-24 22:15:47 UTC
The Machine phases go from Provisioning->Provisioned->Running, and the timeout applies to each phase (*not* the aggregate time since the Machine was created, as intimated in the code comment). So if the host spends more than the timeout time in either Provisioning or Provisioned, then we will attempt to remediate it. The timeout will likely need to be longer than on any other platform, simply because provisioning baremetal tends to take a long time - more than 10 minutes is routine.

However, there is an additional thing that could be addressed: currently the CAPBM transitions the Machine to the Provisioned phase as soon as it has selected a Host to deploy to. So the Provisioning phase will always be very short, while the Provisioned phase will be very very long since it encompasses the time all the way from when a Host is selected to when a kubelet is up and running on it and the Node has been linked to the Host.

So one thing the CAPBM should do is remain in the Provisioning state until the baremetal-operator decides that the Host is provisioned, and only then move to Provisioned. That would mean the timeout would not need to be increased by as much. I am planning to implement this anyway as part of fixes for bug 1868104.

Comment 5 Michael Gugino 2020-08-28 16:00:34 UTC
(In reply to Zane Bitter from comment #4)
> The Machine phases go from Provisioning->Provisioned->Running, and the
> timeout applies to each phase (*not* the aggregate time since the Machine
> was created, as intimated in the code comment). So if the host spends more
> than the timeout time in either Provisioning or Provisioned, then we will
> attempt to remediate it. The timeout will likely need to be longer than on
> any other platform, simply because provisioning baremetal tends to take a
> long time - more than 10 minutes is routine.
> 
> However, there is an additional thing that could be addressed: currently the
> CAPBM transitions the Machine to the Provisioned phase as soon as it has
> selected a Host to deploy to. So the Provisioning phase will always be very
> short, while the Provisioned phase will be very very long since it
> encompasses the time all the way from when a Host is selected to when a
> kubelet is up and running on it and the Node has been linked to the Host.
> 
> So one thing the CAPBM should do is remain in the Provisioning state until
> the baremetal-operator decides that the Host is provisioned, and only then
> move to Provisioned. That would mean the timeout would not need to be
> increased by as much. I am planning to implement this anyway as part of
> fixes for bug 1868104.

Provisioned is set by the machine-controller, not the actuator.  That setting is based on having an instance ID, and networking information.

I suggest setting MHC to 55 minutes, 10 minutes is way too low for BM.

Comment 6 Zane Bitter 2020-08-28 16:18:55 UTC
(In reply to Michael Gugino from comment #5)
> Provisioned is set by the machine-controller, not the actuator.

Yes, it would have been more accurate to say that the Provisioner induces the machine controller to change the state.

> That
> setting is based on having an instance ID, and networking information.

It's based on Exists() returning true AND either having an instance ID OR networking information. Currently the CAPBM provides all of these things as soon as it has selected a Host to provision.

Comment 8 Beth White 2020-09-03 12:42:01 UTC
Due to the workaround fixing the issue and Zane's comment:

"one thing the CAPBM should do is remain in the Provisioning state until the baremetal-operator decides that the Host is provisioned, and only then move to Provisioned. That would mean the timeout would not need to be increased by as much. I am planning to implement this anyway as part of fixes for bug 1868104."

I am marking this as closed duplicate of 1868104 since the fixes for that bug will also fix this one.

*** This bug has been marked as a duplicate of bug 1868104 ***