Bug 1840581
Summary: | New machine is deleted after provisioning | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | vsibirsk | ||||
Component: | Cloud Compute | Assignee: | Beth White <beth.white> | ||||
Cloud Compute sub component: | BareMetal Provider | QA Contact: | Raviv Bar-Tal <rbartal> | ||||
Status: | CLOSED DUPLICATE | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | medium | CC: | beth.white, danken, dhellmann, ipinto, mgugino, rgarcia, stbenjam, zbitter | ||||
Version: | 4.5 | Keywords: | TestBlockerForLayeredProduct, Triaged | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.6.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-09-03 12:42:01 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
From the e-mail thread: > I found the place it needs to be configured: https://github.com/openshift/machine-api-operator/blob/master/pkg/apis/machine/v1beta1/machinehealthcheck_types.go#L70 The Machine phases go from Provisioning->Provisioned->Running, and the timeout applies to each phase (*not* the aggregate time since the Machine was created, as intimated in the code comment). So if the host spends more than the timeout time in either Provisioning or Provisioned, then we will attempt to remediate it. The timeout will likely need to be longer than on any other platform, simply because provisioning baremetal tends to take a long time - more than 10 minutes is routine. However, there is an additional thing that could be addressed: currently the CAPBM transitions the Machine to the Provisioned phase as soon as it has selected a Host to deploy to. So the Provisioning phase will always be very short, while the Provisioned phase will be very very long since it encompasses the time all the way from when a Host is selected to when a kubelet is up and running on it and the Node has been linked to the Host. So one thing the CAPBM should do is remain in the Provisioning state until the baremetal-operator decides that the Host is provisioned, and only then move to Provisioned. That would mean the timeout would not need to be increased by as much. I am planning to implement this anyway as part of fixes for bug 1868104. (In reply to Zane Bitter from comment #4) > The Machine phases go from Provisioning->Provisioned->Running, and the > timeout applies to each phase (*not* the aggregate time since the Machine > was created, as intimated in the code comment). So if the host spends more > than the timeout time in either Provisioning or Provisioned, then we will > attempt to remediate it. The timeout will likely need to be longer than on > any other platform, simply because provisioning baremetal tends to take a > long time - more than 10 minutes is routine. > > However, there is an additional thing that could be addressed: currently the > CAPBM transitions the Machine to the Provisioned phase as soon as it has > selected a Host to deploy to. So the Provisioning phase will always be very > short, while the Provisioned phase will be very very long since it > encompasses the time all the way from when a Host is selected to when a > kubelet is up and running on it and the Node has been linked to the Host. > > So one thing the CAPBM should do is remain in the Provisioning state until > the baremetal-operator decides that the Host is provisioned, and only then > move to Provisioned. That would mean the timeout would not need to be > increased by as much. I am planning to implement this anyway as part of > fixes for bug 1868104. Provisioned is set by the machine-controller, not the actuator. That setting is based on having an instance ID, and networking information. I suggest setting MHC to 55 minutes, 10 minutes is way too low for BM. (In reply to Michael Gugino from comment #5) > Provisioned is set by the machine-controller, not the actuator. Yes, it would have been more accurate to say that the Provisioner induces the machine controller to change the state. > That > setting is based on having an instance ID, and networking information. It's based on Exists() returning true AND either having an instance ID OR networking information. Currently the CAPBM provides all of these things as soon as it has selected a Host to provision. Due to the workaround fixing the issue and Zane's comment: "one thing the CAPBM should do is remain in the Provisioning state until the baremetal-operator decides that the Host is provisioned, and only then move to Provisioned. That would mean the timeout would not need to be increased by as much. I am planning to implement this anyway as part of fixes for bug 1868104." I am marking this as closed duplicate of 1868104 since the fixes for that bug will also fix this one. *** This bug has been marked as a duplicate of bug 1868104 *** |
Created attachment 1692609 [details] machinehealthcheck log Description of problem: New machine is deleted every 10min since new node is not associated with it (node installation takes more than 10min) Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.Configure MHC object 2."kill" one of the nodes (stop kubelet service) Actual results: New machine is provisioned, but deleted after 10min since the new node is not yet installed (and this goes into loop) Expected results: Machine waits for node to be installed Additional info: