Bug 1845137
| Summary: | Scaling down of machineset fails to delete machine properly | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lubov <lshilin> | ||||||||||||||
| Component: | Cloud Compute | Assignee: | Doug Hellmann <dhellmann> | ||||||||||||||
| Cloud Compute sub component: | BareMetal Provider | QA Contact: | Lubov <lshilin> | ||||||||||||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||||||||||||
| Severity: | urgent | ||||||||||||||||
| Priority: | medium | CC: | agurenko, calfonso, mgugino, stbenjam, yprokule | ||||||||||||||
| Version: | 4.5 | Keywords: | AutomationBlocker, Regression, TestBlocker, Triaged | ||||||||||||||
| Target Milestone: | --- | ||||||||||||||||
| Target Release: | 4.6.0 | ||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||
| OS: | Linux | ||||||||||||||||
| Whiteboard: | |||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||
| Last Closed: | 2020-07-27 19:57:28 UTC | Type: | Bug | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Lubov
2020-06-08 14:17:51 UTC
Created attachment 1696125 [details]
machine-api-operator container log
Created attachment 1696126 [details]
monitoring operator log
Created attachment 1696127 [details]
machine controller log
Created attachment 1696128 [details]
machineset controller log
Re-checked with Server Version: 4.5.0-0.nightly-2020-06-09-223121 Now draining succeeded but node still not deleted Attaching updated machine-controller.log Created attachment 1696487 [details]
updated machine controller log
From machine-controller: "can't proceed deleting machine while cloud instance is being terminated, requeuing" What this means is the bare metal provider is indicating that the instance/server still exists according to the bare metal operator. So, either the instance is never being successfully deleted by the bare metal operator, or if the BMO is trying to re-use an instance name, the replacement already exists. I have a feeling this might be the case due to the original case comment: "If we try to scale up the machineset, the machine in Deleting state is still listed, new machine shown in provisioned state". We can also see from the machine-controller logs that the node for this associated machine is unreachable, which indicates it was stopped at some point or otherwise isolated from the cluster (this does not happen during a normal drain, most likely this was caused by the BMO). Since the 'new machine' was stuck in provisioned state in the original case comment, it gives me the indication that names are being reused in the infrastructure, and that's going to cause these problems. One way the BMO could handle this is to keep track of instance/machine ID mappings. For example, when the BM provider queries the BMO for 'machine-1', it should associate that with some actual instance, 'instance-a'. The BMO needs to keep track of the lifecycle of 'instance-a' and ensure it's not associated with any new machine until it's completely deprovisioned. When the BM actuator inquires about 'machine-1' again, it should know that 'instance-a' is no longer associated with 'machine-1' and return 404. This would allow 'machine-2' to be associated with 'instance-a' internally to the BMO, so these types of races/conflicts can't happen. Anyway, that's what I *think* is possibly happening, there's not really enough information in this BZ to confirm. I'm reasonably confident that this bug is the same as 1855823. The fix for that has already merged. If I am missing some detail about why they are not the same, please let me know. *** This bug has been marked as a duplicate of bug 1855823 *** |