|Summary:||Missing node prevents machine from being delete|
|Product:||OpenShift Container Platform||Reporter:||Erik M Jacobs <ejacobs>|
|Component:||Cloud Compute||Assignee:||Michael Gugino <mgugino>|
|Status:||POST ---||QA Contact:||Jianwei Hou <jhou>|
|Version:||4.1.0||CC:||agarcial, jchaloup, mgugino|
|Fixed In Version:||Doc Type:||If docs needed, set a value|
|Doc Text:||Story Points:||---|
|:||1713105 (view as bug list)||Environment:|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:||1713105|
Description Erik M Jacobs 2019-05-22 18:52:40 UTC
Created attachment 1572147 [details] machine-controller log output Description of problem: a machineset was scaled up and then scaled down. the nodes disappeared but the machine objects remain Version-Release number of selected component (if applicable): 4.1.0-rc.4 U3r2LdrhT-A= Additional info: NAME INSTANCE STATE TYPE REGION ZONE AGE cluster-4e40-c7df5-master-0 i-087186746072193f0 running m4.xlarge us-east-2 us-east-2a 24h cluster-4e40-c7df5-master-1 i-0eafe7e9e69f6aaec running m4.xlarge us-east-2 us-east-2b 24h cluster-4e40-c7df5-master-2 i-03c13bba692694646 running m4.xlarge us-east-2 us-east-2c 24h infranode-us-east-2a-t7xwt i-0c6ce0f9d57708d22 running m4.large us-east-2 us-east-2a 173m infranode-us-east-2a-z9nfh i-0c3f83d4c9003f5d0 running m4.large us-east-2 us-east-2a 3h39m nossd-1a-dczcf i-00a207dab2c9e970d running m4.large us-east-2 us-east-2a 3h57m ssd-1a-5l9fh i-090acc4f9598a37f3 running m4.large us-east-2 us-east-2a 121m ssd-1a-7cvrr i-0ccca476b234fc1da running m4.large us-east-2 us-east-2a 69m ssd-1a-q52pv i-0e9e6d01af5ca727a running m4.large us-east-2 us-east-2a 121m ssd-1a-q6hr9 i-08f4a48151276ce90 running m4.large us-east-2 us-east-2a 121m ssd-1a-sfhdm i-03eec775cb1ce8f3c running m4.large us-east-2 us-east-2a 121m ssd-1b-rtxxg i-08d06740a65e88be6 running m4.large us-east-2 us-east-2b 3h57m The machines that are 121m old in the `ssd-1a` set are the "orphans" without corresponding nodes. Each of them has a deletiontimestamp.
Comment 1 Michael Gugino 2019-05-22 20:37:20 UTC
I have investigated this. We're failing to retrieve the node from the nodeRef specified on the machine-object. This is either because the machine-controller deleted the node already and failed to update that annotation for some reason, or an admin removed the node manually before attempting to scale. Either way, this is definitely a bug and is not easily correctable by the end-user. I will get a patch out for master and pick to 4.1.
Comment 2 Michael Gugino 2019-05-22 21:05:59 UTC
Added a reference to 4.1 known-issue tracker: https://github.com/openshift/openshift-docs/issues/12487
Comment 3 Michael Gugino 2019-05-22 22:18:08 UTC
Workaround: For a machine stuck in this state, after confirming the node is actually absent from the cluster, you can Add the following annotation to the machine's metadata: "machine.openshift.io/exclude-node-draining"
Comment 4 Michael Gugino 2019-05-24 14:52:07 UTC
PR opened in openshift/cluster-api on 4.1. https://github.com/openshift/cluster-api/pull/44 After this merges, we'll need to re-vendor this change across the aws and libvirt actuators.
Comment 7 Michael Gugino 2019-08-22 22:01:38 UTC
PR Merged in cluster-api; Still need to vendor changes into AWS provider.