1713061 – Missing node prevents machine from being delete

Bug 1713061 - Missing node prevents machine from being delete

Summary: Missing node prevents machine from being delete

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:	1713105
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-22 18:52 UTC by Erik M Jacobs
Modified:	2020-04-29 15:45 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1713105 (view as bug list)
Environment:
Last Closed:	2020-04-29 15:45:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
machine-controller log output (96.21 KB, application/gzip) 2019-05-22 18:52 UTC, Erik M Jacobs	no flags	Details
View All

Description Erik M Jacobs 2019-05-22 18:52:40 UTC

Created attachment 1572147 [details]
machine-controller log output

Description of problem:
a machineset was scaled up and then scaled down. the nodes disappeared but the machine objects remain

Version-Release number of selected component (if applicable):
4.1.0-rc.4
U3r2LdrhT-A=

Additional info:
NAME                          INSTANCE              STATE     TYPE        REGION      ZONE         AGE
cluster-4e40-c7df5-master-0   i-087186746072193f0   running   m4.xlarge   us-east-2   us-east-2a   24h
cluster-4e40-c7df5-master-1   i-0eafe7e9e69f6aaec   running   m4.xlarge   us-east-2   us-east-2b   24h
cluster-4e40-c7df5-master-2   i-03c13bba692694646   running   m4.xlarge   us-east-2   us-east-2c   24h
infranode-us-east-2a-t7xwt    i-0c6ce0f9d57708d22   running   m4.large    us-east-2   us-east-2a   173m
infranode-us-east-2a-z9nfh    i-0c3f83d4c9003f5d0   running   m4.large    us-east-2   us-east-2a   3h39m
nossd-1a-dczcf                i-00a207dab2c9e970d   running   m4.large    us-east-2   us-east-2a   3h57m
ssd-1a-5l9fh                  i-090acc4f9598a37f3   running   m4.large    us-east-2   us-east-2a   121m
ssd-1a-7cvrr                  i-0ccca476b234fc1da   running   m4.large    us-east-2   us-east-2a   69m
ssd-1a-q52pv                  i-0e9e6d01af5ca727a   running   m4.large    us-east-2   us-east-2a   121m
ssd-1a-q6hr9                  i-08f4a48151276ce90   running   m4.large    us-east-2   us-east-2a   121m
ssd-1a-sfhdm                  i-03eec775cb1ce8f3c   running   m4.large    us-east-2   us-east-2a   121m
ssd-1b-rtxxg                  i-08d06740a65e88be6   running   m4.large    us-east-2   us-east-2b   3h57m


The machines that are 121m old in the `ssd-1a` set are the "orphans" without corresponding nodes. Each of them has a deletiontimestamp.

Comment 1 Michael Gugino 2019-05-22 20:37:20 UTC

I have investigated this.  We're failing to retrieve the node from the nodeRef specified on the machine-object.  This is either because the machine-controller deleted the node already and failed to update that annotation for some reason, or an admin removed the node manually before attempting to scale.  Either way, this is definitely a bug and is not easily correctable by the end-user.  I will get a patch out for master and pick to 4.1.

Comment 2 Michael Gugino 2019-05-22 21:05:59 UTC

Added a reference to 4.1 known-issue tracker: https://github.com/openshift/openshift-docs/issues/12487

Comment 3 Michael Gugino 2019-05-22 22:18:08 UTC

Workaround: For a machine stuck in this state, after confirming the node is actually absent from the cluster, you can Add the following annotation to the machine's metadata: "machine.openshift.io/exclude-node-draining"

Comment 4 Michael Gugino 2019-05-24 14:52:07 UTC

PR opened in openshift/cluster-api on 4.1.  https://github.com/openshift/cluster-api/pull/44

After this merges, we'll need to re-vendor this change across the aws and libvirt actuators.

Comment 7 Michael Gugino 2019-08-22 22:01:38 UTC

PR Merged in cluster-api;  Still need to vendor changes into AWS provider.

Note You need to log in before you can comment on or make changes to this bug.