Bug 1713061
| Summary: | Missing node prevents machine from being delete | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Erik M Jacobs <ejacobs> | ||||
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> | ||||
| Cloud Compute sub component: | Other Providers | QA Contact: | Jianwei Hou <jhou> | ||||
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||
| Severity: | unspecified | ||||||
| Priority: | high | CC: | agarcial, jchaloup, mgugino | ||||
| Version: | 4.1.0 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.1.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1713105 (view as bug list) | Environment: | |||||
| Last Closed: | 2020-04-29 15:45:44 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1713105 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
I have investigated this. We're failing to retrieve the node from the nodeRef specified on the machine-object. This is either because the machine-controller deleted the node already and failed to update that annotation for some reason, or an admin removed the node manually before attempting to scale. Either way, this is definitely a bug and is not easily correctable by the end-user. I will get a patch out for master and pick to 4.1. Added a reference to 4.1 known-issue tracker: https://github.com/openshift/openshift-docs/issues/12487 Workaround: For a machine stuck in this state, after confirming the node is actually absent from the cluster, you can Add the following annotation to the machine's metadata: "machine.openshift.io/exclude-node-draining" PR opened in openshift/cluster-api on 4.1. https://github.com/openshift/cluster-api/pull/44 After this merges, we'll need to re-vendor this change across the aws and libvirt actuators. PR Merged in cluster-api; Still need to vendor changes into AWS provider. |
Created attachment 1572147 [details] machine-controller log output Description of problem: a machineset was scaled up and then scaled down. the nodes disappeared but the machine objects remain Version-Release number of selected component (if applicable): 4.1.0-rc.4 U3r2LdrhT-A= Additional info: NAME INSTANCE STATE TYPE REGION ZONE AGE cluster-4e40-c7df5-master-0 i-087186746072193f0 running m4.xlarge us-east-2 us-east-2a 24h cluster-4e40-c7df5-master-1 i-0eafe7e9e69f6aaec running m4.xlarge us-east-2 us-east-2b 24h cluster-4e40-c7df5-master-2 i-03c13bba692694646 running m4.xlarge us-east-2 us-east-2c 24h infranode-us-east-2a-t7xwt i-0c6ce0f9d57708d22 running m4.large us-east-2 us-east-2a 173m infranode-us-east-2a-z9nfh i-0c3f83d4c9003f5d0 running m4.large us-east-2 us-east-2a 3h39m nossd-1a-dczcf i-00a207dab2c9e970d running m4.large us-east-2 us-east-2a 3h57m ssd-1a-5l9fh i-090acc4f9598a37f3 running m4.large us-east-2 us-east-2a 121m ssd-1a-7cvrr i-0ccca476b234fc1da running m4.large us-east-2 us-east-2a 69m ssd-1a-q52pv i-0e9e6d01af5ca727a running m4.large us-east-2 us-east-2a 121m ssd-1a-q6hr9 i-08f4a48151276ce90 running m4.large us-east-2 us-east-2a 121m ssd-1a-sfhdm i-03eec775cb1ce8f3c running m4.large us-east-2 us-east-2a 121m ssd-1b-rtxxg i-08d06740a65e88be6 running m4.large us-east-2 us-east-2b 3h57m The machines that are 121m old in the `ssd-1a` set are the "orphans" without corresponding nodes. Each of them has a deletiontimestamp.