Bug 1800315

Summary: Machine controller must wait for node to be fully terminated before deleting Kube Node object
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Cloud ComputeAssignee: Alexander Demicev <ademicev>
Status: CLOSED ERRATA QA Contact: sunzhaohua <zhsun>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: agarcial, jhou
Target Milestone: ---   
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1800314
: 1800316 (view as bug list) Environment:
Last Closed: 2020-03-24 14:33:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1800314    
Bug Blocks: 1800316, 1800317    

Description Clayton Coleman 2020-02-06 20:54:58 UTC
+++ This bug was initially created as a clone of Bug #1800314 +++

Deleting the Node object in Kubernetes signals to the cluster that no pod on that node is running and that it is safe to release any storage or process locks that ensure two processes can't be running on different nodes with the same name or volumes.  The machine controller was deleting the node before the machine was fully terminated, which means that a stateful set controller would be able to launch two pods with the same name running on the cluster at the same time, which violates our cluster safety guarantees.

Fix is to wait for machine is confirmed shut down by cloud provider before deleting the Node object.

Comment 3 sunzhaohua 2020-03-09 15:14:39 UTC
Verified
Tested on Azure, aws, gcp the timestamp shows node is deleted after the instance is deleted.
clusterversion: 4.3.5

Azure
I0309 09:23:58.764338       1 actuator.go:135] Deleting machine zhsunazure-vn5cs-worker-centralus3-t6sqz
I0309 09:23:58.764943       1 virtualmachines.go:225] deleting vm zhsunazure-vn5cs-worker-centralus3-t6sqz 
I0309 09:23:58.998472       1 virtualmachines.go:242] successfully deleted vm zhsunazure-vn5cs-worker-centralus3-t6sqz 
I0309 09:23:58.998498       1 disks.go:49] deleting disk zhsunazure-vn5cs-worker-centralus3-t6sqz_OSDisk
I0309 09:23:59.029511       1 disks.go:65] successfully deleted disk zhsunazure-vn5cs-worker-centralus3-t6sqz_OSDisk
I0309 09:23:59.029540       1 networkinterfaces.go:215] deleting nic zhsunazure-vn5cs-worker-centralus3-t6sqz-nic
I0309 09:24:09.226004       1 networkinterfaces.go:234] successfully deleted nic zhsunazure-vn5cs-worker-centralus3-t6sqz-nic
I0309 09:24:09.226070       1 machine_scope.go:180] zhsunazure-vn5cs-worker-centralus3-t6sqz: status unchanged
I0309 09:24:09.239968       1 controller.go:254] Deleting node "zhsunazure-vn5cs-worker-centralus3-t6sqz" for machine "zhsunazure-vn5cs-worker-centralus3-t6sqz"
I0309 09:24:09.240456       1 reflector.go:120] Starting reflector *v1.Node (10h0m0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204
I0309 09:24:09.240491       1 reflector.go:158] Listing and watching *v1.Node from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204
I0309 09:24:09.398866       1 controller.go:268] Machine "zhsunazure-vn5cs-worker-centralus3-t6sqz" deletion successful

AWS
I0309 09:40:36.447595       1 actuator.go:338] zhsunaws-9mm2f-worker-us-east-2b-jc9d2: deleting machine
I0309 09:40:36.526295       1 actuator.go:551] zhsunaws-9mm2f-worker-us-east-2b-jc9d2: Found instance by id: i-0d202b31e33653a48
I0309 09:40:36.526321       1 actuator.go:380] zhsunaws-9mm2f-worker-us-east-2b-jc9d2: found 1 existing instances for machine
I0309 09:40:36.526332       1 utils.go:196] Cleaning up extraneous instance for machine: i-0d202b31e33653a48, state: shutting-down, launchTime: 2020-03-09 07:50:51 +0000 UTC
I0309 09:40:36.526349       1 utils.go:200] Terminating i-0d202b31e33653a48 instance
I0309 09:40:36.636576       1 controller.go:254] Deleting node "ip-10-0-151-64.us-east-2.compute.internal" for machine "zhsunaws-9mm2f-worker-us-east-2b-jc9d2"
I0309 09:40:36.636617       1 controller.go:397] Node "ip-10-0-151-64.us-east-2.compute.internal" not found
I0309 09:40:36.651471       1 controller.go:268] Machine "zhsunaws-9mm2f-worker-us-east-2b-jc9d2" deletion successful

gcp
I0309 15:08:46.989893       1 info.go:20] drained node "zhsun8-2j9sv-w-b-r2lj5.c.openshift-qe.internal"
I0309 15:08:46.989906       1 controller.go:342] drain successful for machine "zhsun8-2j9sv-w-b-r2lj5"
I0309 15:08:46.989939       1 actuator.go:118] zhsun8-2j9sv-w-b-r2lj5: Deleting machine
I0309 15:08:47.223762       1 reconciler.go:302] zhsun8-2j9sv-w-b-r2lj5: Machine does not exist
I0309 15:08:47.223788       1 reconciler.go:319] zhsun8-2j9sv-w-b-r2lj5: Machine not found during delete, skipping
I0309 15:08:47.223829       1 controller.go:228] Deleting node "zhsun8-2j9sv-w-b-r2lj5.c.openshift-qe.internal" for machine "zhsun8-2j9sv-w-b-r2lj5"
I0309 15:08:47.224343       1 reflector.go:120] Starting reflector *v1.Node (10h0m0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204
I0309 15:08:47.224370       1 reflector.go:158] Listing and watching *v1.Node from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204
I0309 15:08:47.359139       1 controller.go:242] Machine "zhsun8-2j9sv-w-b-r2lj5" deletion successful

Comment 5 errata-xmlrpc 2020-03-24 14:33:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0858