1800314 – Machine controller must wait for node to be fully terminated before deleting Kube Node object

Bug 1800314 - Machine controller must wait for node to be fully terminated before deleting Kube Node object

Summary: Machine controller must wait for node to be fully terminated before deleting ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Alexander Demicev
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1800315 1800316 1800317
TreeView+	depends on / blocked

Reported:	2020-02-06 20:54 UTC by Clayton Coleman
Modified:	2020-05-13 21:57 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1800315 (view as bug list)
Environment:
Last Closed:	2020-05-13 21:56:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 490	0	None	closed	Bug 1800314: Check if instance exists before proceeding with deletion	2020-08-28 20:50:10 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-13 21:57:01 UTC

Description Clayton Coleman 2020-02-06 20:54:24 UTC

Deleting the Node object in Kubernetes signals to the cluster that no pod on that node is running and that it is safe to release any storage or process locks that ensure two processes can't be running on different nodes with the same name or volumes.  The machine controller was deleting the node before the machine was fully terminated, which means that a stateful set controller would be able to launch two pods with the same name running on the cluster at the same time, which violates our cluster safety guarantees.

Fix is to wait for machine is confirmed shut down by cloud provider before deleting the Node object.

Comment 1 Alberto 2020-02-07 08:17:21 UTC

Can you specify how to reproduce this?
The machine controller is expected to drain the node, then ask the provider to delete the instance, then delete the node. https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L209-L229.

Comment 4 Alberto 2020-02-17 10:25:31 UTC

Moving back to as this needs to be verified only after:
https://github.com/openshift/cluster-api-provider-azure/pull/105
https://github.com/openshift/cluster-api-provider-aws/pull/296
https://github.com/openshift/cluster-api-provider-gcp/pull/74

are merged which also fix https://bugzilla.redhat.com/show_bug.cgi?id=1733474

Comment 6 Jianwei Hou 2020-02-19 10:28:23 UTC

Tested on Azure, the timestamp shows node is deleted after the instance is deleted.

I0219 10:10:51.591121       1 controller.go:332] Could not find node from noderef, it may have already been deleted: qe-jhou-rqfpj-worker-centralus2-tqdvz
I0219 10:10:51.591141       1 actuator.go:128] Deleting machine qe-jhou-rqfpj-worker-centralus2-tqdvz
I0219 10:10:51.591810       1 virtualmachines.go:225] deleting vm qe-jhou-rqfpj-worker-centralus2-tqdvz
I0219 10:10:51.815731       1 virtualmachines.go:242] successfully deleted vm qe-jhou-rqfpj-worker-centralus2-tqdvz
I0219 10:10:51.815758       1 disks.go:49] deleting disk qe-jhou-rqfpj-worker-centralus2-tqdvz_OSDisk
I0219 10:10:51.868334       1 disks.go:65] successfully deleted disk qe-jhou-rqfpj-worker-centralus2-tqdvz_OSDisk
I0219 10:10:51.868370       1 networkinterfaces.go:215] deleting nic qe-jhou-rqfpj-worker-centralus2-tqdvz-nic
I0219 10:11:02.111140       1 networkinterfaces.go:234] successfully deleted nic qe-jhou-rqfpj-worker-centralus2-tqdvz-nic
I0219 10:11:02.111240       1 machine_scope.go:170] qe-jhou-rqfpj-worker-centralus2-tqdvz: status unchanged
I0219 10:11:02.111284       1 machine_scope.go:170] qe-jhou-rqfpj-worker-centralus2-tqdvz: status unchanged
I0219 10:11:02.111293       1 machine_scope.go:190] qe-jhou-rqfpj-worker-centralus2-tqdvz: patching machine
I0219 10:11:02.125843       1 actuator.go:196] Checking if machine qe-jhou-rqfpj-worker-centralus2-tqdvz exists
I0219 10:11:02.365445       1 controller.go:241] qe-jhou-rqfpj-worker-centralus2-tqdvz: deleting node "qe-jhou-rqfpj-worker-centralus2-tqdvz" for machine
I0219 10:11:02.365493       1 controller.go:388] Node "qe-jhou-rqfpj-worker-centralus2-tqdvz" not found
I0219 10:11:02.395353       1 controller.go:255] qe-jhou-rqfpj-worker-centralus2-tqdvz: machine deletion successful


Will also cover the rest provider before moving this to verified.

Comment 7 sunzhaohua 2020-02-20 01:56:12 UTC

Verified on AWS:4.4.0-0.nightly-2020-02-18-005549

Machine controller deletes node object until instance is terminated, we can see logs bellow:
can't proceed deleting machine while cloud instance is being terminated
Failed to find existing instance by id i-08621c75e167a4f68: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
Instance does not exist
deleting node "zhsun4-s2gcg-worker-us-east-2c-ftk9g" for machine
machine deletion successful

I0219 03:09:21.531611       1 actuator.go:538] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Found instance by id: i-08621c75e167a4f68
I0219 03:09:21.531646       1 controller.go:236] zhsun4-s2gcg-worker-us-east-2c-ftk9g: can't proceed deleting machine while cloud instance is being terminated, requeuing
I0219 03:09:21.531712       1 controller.go:163] zhsun4-s2gcg-worker-us-east-2c-jph8g: reconciling Machine
I0219 ...
I0219 03:09:51.284449       1 actuator.go:315] zhsun4-s2gcg-worker-us-east-2c-ftk9g: deleting machine


W0219 03:09:51.374462       1 actuator.go:535] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Failed to find existing instance by id i-08621c75e167a4f68: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
E0219 03:09:51.468978       1 utils.go:165] Excluding instance matching zhsun4-s2gcg-worker-us-east-2c-ftk9g: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
I0219 03:09:51.469018       1 actuator.go:357] zhsun4-s2gcg-worker-us-east-2c-ftk9g: found 0 existing instances for machine
W0219 03:09:51.469032       1 actuator.go:359] zhsun4-s2gcg-worker-us-east-2c-ftk9g: no instances found to delete for machine
I0219 03:09:51.469042       1 actuator.go:489] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Checking if machine exists
W0219 03:09:51.531060       1 actuator.go:535] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Failed to find existing instance by id i-08621c75e167a4f68: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
E0219 03:09:51.591483       1 utils.go:165] Excluding instance matching zhsun4-s2gcg-worker-us-east-2c-ftk9g: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
I0219 03:09:51.591514       1 actuator.go:502] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Instance does not exist
I0219 03:09:51.591529       1 controller.go:241] ip-10-0-169-219.us-east-2.compute.internal: deleting node "zhsun4-s2gcg-worker-us-east-2c-ftk9g" for machine
I0219 03:09:51.631704       1 controller.go:255] zhsun4-s2gcg-worker-us-east-2c-ftk9g: machine deletion successful

Comment 8 sunzhaohua 2020-02-21 02:46:21 UTC

Verified on gcp, by looking the timestamp node is deleted after the instance is deleted.
clusterversion: 4.4.0-0.nightly-2020-02-19-173908

I0221 02:34:07.226447       1 controller.go:321] Could not find node from noderef, it may have already been deleted: zhsung-nrjcj-w-a-fksnz.c.openshift-qe.internal
I0221 02:34:07.226487       1 actuator.go:118] zhsung-nrjcj-w-a-fksnz: Deleting machine

I0221 02:34:07.441236       1 reconciler.go:302] zhsung-nrjcj-w-a-fksnz: Machine does not exist
I0221 02:34:07.441442       1 reconciler.go:319] zhsung-nrjcj-w-a-fksnz: Machine not found during delete, skipping
I0221 02:34:07.441547       1 controller.go:230] zhsung-nrjcj-w-a-fksnz.c.openshift-qe.internal: deleting node "zhsung-nrjcj-w-a-fksnz" for machine
I0221 02:34:07.441666       1 controller.go:377] Node "zhsung-nrjcj-w-a-fksnz.c.openshift-qe.internal" not found
I0221 02:34:07.472003       1 controller.go:244] zhsung-nrjcj-w-a-fksnz: machine deletion successful

Comment 10 errata-xmlrpc 2020-05-13 21:56:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.