Bug 1840581

Summary:

New machine is deleted after provisioning

Product:

OpenShift Container Platform

Reporter:

vsibirsk

Component:

Cloud Compute

Assignee:

Beth White <beth.white>

Cloud Compute sub component:

BareMetal Provider

QA Contact:

Raviv Bar-Tal <rbartal>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

high

Priority:

medium

CC:

beth.white, danken, dhellmann, ipinto, mgugino, rgarcia, stbenjam, zbitter

Version:

4.5

Keywords:

TestBlockerForLayeredProduct, Triaged

Target Milestone:

---

Target Release:

4.6.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-09-03 12:42:01 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
machinehealthcheck log	none

Description vsibirsk 2020-05-27 09:17:24 UTC

Created attachment 1692609 [details]
machinehealthcheck log

Description of problem:
New machine is deleted every 10min since new node is not associated with it (node installation takes more than 10min)

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Configure MHC object
2."kill" one of the nodes (stop kubelet service)

Actual results:
New machine is provisioned, but deleted after 10min since the new node is not yet installed (and this goes into loop)

Expected results:
Machine waits for node to be installed

Additional info:

Comment 1 Stephen Benjamin 2020-05-27 11:26:37 UTC

From the e-mail thread:

> I found the place it needs to be configured:
   https://github.com/openshift/machine-api-operator/blob/master/pkg/apis/machine/v1beta1/machinehealthcheck_types.go#L70

Comment 4 Zane Bitter 2020-08-24 22:15:47 UTC

The Machine phases go from Provisioning->Provisioned->Running, and the timeout applies to each phase (*not* the aggregate time since the Machine was created, as intimated in the code comment). So if the host spends more than the timeout time in either Provisioning or Provisioned, then we will attempt to remediate it. The timeout will likely need to be longer than on any other platform, simply because provisioning baremetal tends to take a long time - more than 10 minutes is routine.

However, there is an additional thing that could be addressed: currently the CAPBM transitions the Machine to the Provisioned phase as soon as it has selected a Host to deploy to. So the Provisioning phase will always be very short, while the Provisioned phase will be very very long since it encompasses the time all the way from when a Host is selected to when a kubelet is up and running on it and the Node has been linked to the Host.

So one thing the CAPBM should do is remain in the Provisioning state until the baremetal-operator decides that the Host is provisioned, and only then move to Provisioned. That would mean the timeout would not need to be increased by as much. I am planning to implement this anyway as part of fixes for bug 1868104.

Comment 5 Michael Gugino 2020-08-28 16:00:34 UTC

(In reply to Zane Bitter from comment #4)
> The Machine phases go from Provisioning->Provisioned->Running, and the
> timeout applies to each phase (*not* the aggregate time since the Machine
> was created, as intimated in the code comment). So if the host spends more
> than the timeout time in either Provisioning or Provisioned, then we will
> attempt to remediate it. The timeout will likely need to be longer than on
> any other platform, simply because provisioning baremetal tends to take a
> long time - more than 10 minutes is routine.
> 
> However, there is an additional thing that could be addressed: currently the
> CAPBM transitions the Machine to the Provisioned phase as soon as it has
> selected a Host to deploy to. So the Provisioning phase will always be very
> short, while the Provisioned phase will be very very long since it
> encompasses the time all the way from when a Host is selected to when a
> kubelet is up and running on it and the Node has been linked to the Host.
> 
> So one thing the CAPBM should do is remain in the Provisioning state until
> the baremetal-operator decides that the Host is provisioned, and only then
> move to Provisioned. That would mean the timeout would not need to be
> increased by as much. I am planning to implement this anyway as part of
> fixes for bug 1868104.

Provisioned is set by the machine-controller, not the actuator.  That setting is based on having an instance ID, and networking information.

I suggest setting MHC to 55 minutes, 10 minutes is way too low for BM.

Comment 6 Zane Bitter 2020-08-28 16:18:55 UTC

(In reply to Michael Gugino from comment #5)
> Provisioned is set by the machine-controller, not the actuator.

Yes, it would have been more accurate to say that the Provisioner induces the machine controller to change the state.

> That
> setting is based on having an instance ID, and networking information.

It's based on Exists() returning true AND either having an instance ID OR networking information. Currently the CAPBM provides all of these things as soon as it has selected a Host to provision.

Comment 8 Beth White 2020-09-03 12:42:01 UTC

Due to the workaround fixing the issue and Zane's comment:

"one thing the CAPBM should do is remain in the Provisioning state until the baremetal-operator decides that the Host is provisioned, and only then move to Provisioned. That would mean the timeout would not need to be increased by as much. I am planning to implement this anyway as part of fixes for bug 1868104."

I am marking this as closed duplicate of 1868104 since the fixes for that bug will also fix this one.

*** This bug has been marked as a duplicate of bug 1868104 ***