Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1800314

Summary: Machine controller must wait for node to be fully terminated before deleting Kube Node object
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Cloud ComputeAssignee: Alexander Demicev <ademicev>
Cloud Compute sub component: Other Providers QA Contact: Jianwei Hou <jhou>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: agarcial, mgugino, zhsun
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1800315 (view as bug list) Environment:
Last Closed: 2020-05-13 21:56:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1800315, 1800316, 1800317    

Description Clayton Coleman 2020-02-06 20:54:24 UTC
Deleting the Node object in Kubernetes signals to the cluster that no pod on that node is running and that it is safe to release any storage or process locks that ensure two processes can't be running on different nodes with the same name or volumes.  The machine controller was deleting the node before the machine was fully terminated, which means that a stateful set controller would be able to launch two pods with the same name running on the cluster at the same time, which violates our cluster safety guarantees.

Fix is to wait for machine is confirmed shut down by cloud provider before deleting the Node object.

Comment 1 Alberto 2020-02-07 08:17:21 UTC
Can you specify how to reproduce this?
The machine controller is expected to drain the node, then ask the provider to delete the instance, then delete the node. https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L209-L229.

Comment 6 Jianwei Hou 2020-02-19 10:28:23 UTC
Tested on Azure, the timestamp shows node is deleted after the instance is deleted.

I0219 10:10:51.591121       1 controller.go:332] Could not find node from noderef, it may have already been deleted: qe-jhou-rqfpj-worker-centralus2-tqdvz
I0219 10:10:51.591141       1 actuator.go:128] Deleting machine qe-jhou-rqfpj-worker-centralus2-tqdvz
I0219 10:10:51.591810       1 virtualmachines.go:225] deleting vm qe-jhou-rqfpj-worker-centralus2-tqdvz
I0219 10:10:51.815731       1 virtualmachines.go:242] successfully deleted vm qe-jhou-rqfpj-worker-centralus2-tqdvz
I0219 10:10:51.815758       1 disks.go:49] deleting disk qe-jhou-rqfpj-worker-centralus2-tqdvz_OSDisk
I0219 10:10:51.868334       1 disks.go:65] successfully deleted disk qe-jhou-rqfpj-worker-centralus2-tqdvz_OSDisk
I0219 10:10:51.868370       1 networkinterfaces.go:215] deleting nic qe-jhou-rqfpj-worker-centralus2-tqdvz-nic
I0219 10:11:02.111140       1 networkinterfaces.go:234] successfully deleted nic qe-jhou-rqfpj-worker-centralus2-tqdvz-nic
I0219 10:11:02.111240       1 machine_scope.go:170] qe-jhou-rqfpj-worker-centralus2-tqdvz: status unchanged
I0219 10:11:02.111284       1 machine_scope.go:170] qe-jhou-rqfpj-worker-centralus2-tqdvz: status unchanged
I0219 10:11:02.111293       1 machine_scope.go:190] qe-jhou-rqfpj-worker-centralus2-tqdvz: patching machine
I0219 10:11:02.125843       1 actuator.go:196] Checking if machine qe-jhou-rqfpj-worker-centralus2-tqdvz exists
I0219 10:11:02.365445       1 controller.go:241] qe-jhou-rqfpj-worker-centralus2-tqdvz: deleting node "qe-jhou-rqfpj-worker-centralus2-tqdvz" for machine
I0219 10:11:02.365493       1 controller.go:388] Node "qe-jhou-rqfpj-worker-centralus2-tqdvz" not found
I0219 10:11:02.395353       1 controller.go:255] qe-jhou-rqfpj-worker-centralus2-tqdvz: machine deletion successful


Will also cover the rest provider before moving this to verified.

Comment 7 sunzhaohua 2020-02-20 01:56:12 UTC
Verified on AWS:4.4.0-0.nightly-2020-02-18-005549

Machine controller deletes node object until instance is terminated, we can see logs bellow:
can't proceed deleting machine while cloud instance is being terminated
Failed to find existing instance by id i-08621c75e167a4f68: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
Instance does not exist
deleting node "zhsun4-s2gcg-worker-us-east-2c-ftk9g" for machine
machine deletion successful

I0219 03:09:21.531611       1 actuator.go:538] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Found instance by id: i-08621c75e167a4f68
I0219 03:09:21.531646       1 controller.go:236] zhsun4-s2gcg-worker-us-east-2c-ftk9g: can't proceed deleting machine while cloud instance is being terminated, requeuing
I0219 03:09:21.531712       1 controller.go:163] zhsun4-s2gcg-worker-us-east-2c-jph8g: reconciling Machine
I0219 ...
I0219 03:09:51.284449       1 actuator.go:315] zhsun4-s2gcg-worker-us-east-2c-ftk9g: deleting machine


W0219 03:09:51.374462       1 actuator.go:535] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Failed to find existing instance by id i-08621c75e167a4f68: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
E0219 03:09:51.468978       1 utils.go:165] Excluding instance matching zhsun4-s2gcg-worker-us-east-2c-ftk9g: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
I0219 03:09:51.469018       1 actuator.go:357] zhsun4-s2gcg-worker-us-east-2c-ftk9g: found 0 existing instances for machine
W0219 03:09:51.469032       1 actuator.go:359] zhsun4-s2gcg-worker-us-east-2c-ftk9g: no instances found to delete for machine
I0219 03:09:51.469042       1 actuator.go:489] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Checking if machine exists
W0219 03:09:51.531060       1 actuator.go:535] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Failed to find existing instance by id i-08621c75e167a4f68: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
E0219 03:09:51.591483       1 utils.go:165] Excluding instance matching zhsun4-s2gcg-worker-us-east-2c-ftk9g: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down
I0219 03:09:51.591514       1 actuator.go:502] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Instance does not exist
I0219 03:09:51.591529       1 controller.go:241] ip-10-0-169-219.us-east-2.compute.internal: deleting node "zhsun4-s2gcg-worker-us-east-2c-ftk9g" for machine
I0219 03:09:51.631704       1 controller.go:255] zhsun4-s2gcg-worker-us-east-2c-ftk9g: machine deletion successful

Comment 8 sunzhaohua 2020-02-21 02:46:21 UTC
Verified on gcp, by looking the timestamp node is deleted after the instance is deleted.
clusterversion: 4.4.0-0.nightly-2020-02-19-173908

I0221 02:34:07.226447       1 controller.go:321] Could not find node from noderef, it may have already been deleted: zhsung-nrjcj-w-a-fksnz.c.openshift-qe.internal
I0221 02:34:07.226487       1 actuator.go:118] zhsung-nrjcj-w-a-fksnz: Deleting machine

I0221 02:34:07.441236       1 reconciler.go:302] zhsung-nrjcj-w-a-fksnz: Machine does not exist
I0221 02:34:07.441442       1 reconciler.go:319] zhsung-nrjcj-w-a-fksnz: Machine not found during delete, skipping
I0221 02:34:07.441547       1 controller.go:230] zhsung-nrjcj-w-a-fksnz.c.openshift-qe.internal: deleting node "zhsung-nrjcj-w-a-fksnz" for machine
I0221 02:34:07.441666       1 controller.go:377] Node "zhsung-nrjcj-w-a-fksnz.c.openshift-qe.internal" not found
I0221 02:34:07.472003       1 controller.go:244] zhsung-nrjcj-w-a-fksnz: machine deletion successful

Comment 10 errata-xmlrpc 2020-05-13 21:56:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581