Created attachment 1572147 [details]
machine-controller log output
Description of problem:
a machineset was scaled up and then scaled down. the nodes disappeared but the machine objects remain
Version-Release number of selected component (if applicable):
NAME INSTANCE STATE TYPE REGION ZONE AGE
cluster-4e40-c7df5-master-0 i-087186746072193f0 running m4.xlarge us-east-2 us-east-2a 24h
cluster-4e40-c7df5-master-1 i-0eafe7e9e69f6aaec running m4.xlarge us-east-2 us-east-2b 24h
cluster-4e40-c7df5-master-2 i-03c13bba692694646 running m4.xlarge us-east-2 us-east-2c 24h
infranode-us-east-2a-t7xwt i-0c6ce0f9d57708d22 running m4.large us-east-2 us-east-2a 173m
infranode-us-east-2a-z9nfh i-0c3f83d4c9003f5d0 running m4.large us-east-2 us-east-2a 3h39m
nossd-1a-dczcf i-00a207dab2c9e970d running m4.large us-east-2 us-east-2a 3h57m
ssd-1a-5l9fh i-090acc4f9598a37f3 running m4.large us-east-2 us-east-2a 121m
ssd-1a-7cvrr i-0ccca476b234fc1da running m4.large us-east-2 us-east-2a 69m
ssd-1a-q52pv i-0e9e6d01af5ca727a running m4.large us-east-2 us-east-2a 121m
ssd-1a-q6hr9 i-08f4a48151276ce90 running m4.large us-east-2 us-east-2a 121m
ssd-1a-sfhdm i-03eec775cb1ce8f3c running m4.large us-east-2 us-east-2a 121m
ssd-1b-rtxxg i-08d06740a65e88be6 running m4.large us-east-2 us-east-2b 3h57m
The machines that are 121m old in the `ssd-1a` set are the "orphans" without corresponding nodes. Each of them has a deletiontimestamp.
I have investigated this. We're failing to retrieve the node from the nodeRef specified on the machine-object. This is either because the machine-controller deleted the node already and failed to update that annotation for some reason, or an admin removed the node manually before attempting to scale. Either way, this is definitely a bug and is not easily correctable by the end-user. I will get a patch out for master and pick to 4.1.
Added a reference to 4.1 known-issue tracker: https://github.com/openshift/openshift-docs/issues/12487
Workaround: For a machine stuck in this state, after confirming the node is actually absent from the cluster, you can Add the following annotation to the machine's metadata: "machine.openshift.io/exclude-node-draining"
PR opened in openshift/cluster-api on 4.1. https://github.com/openshift/cluster-api/pull/44
After this merges, we'll need to re-vendor this change across the aws and libvirt actuators.
PR Merged in cluster-api; Still need to vendor changes into AWS provider.