Bug 1800315 - Machine controller must wait for node to be fully terminated before deleting Kube Node object
Summary: Machine controller must wait for node to be fully terminated before deleting ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.3.z
Assignee: Alexander Demicev
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On: 1800314
Blocks: 1800316 1800317
TreeView+ depends on / blocked
 
Reported: 2020-02-06 20:54 UTC by Clayton Coleman
Modified: 2020-03-24 14:33 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1800314
: 1800316 (view as bug list)
Environment:
Last Closed: 2020-03-24 14:33:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-aws pull 298 0 None closed Bug 1800315: [release-4.3] Drop openshift/cluster-api in favour of openshift/machine-api-operator to get controller's ch... 2020-08-24 11:08:53 UTC
Github openshift cluster-api-provider-gcp pull 75 0 None closed Bug 1800315: [release-4.3] Drop openshift/cluster-api in favour of openshift/machine-api-operator to get controller's ch... 2020-08-24 11:08:54 UTC
Red Hat Product Errata RHBA-2020:0858 0 None None None 2020-03-24 14:33:30 UTC

Description Clayton Coleman 2020-02-06 20:54:58 UTC
+++ This bug was initially created as a clone of Bug #1800314 +++

Deleting the Node object in Kubernetes signals to the cluster that no pod on that node is running and that it is safe to release any storage or process locks that ensure two processes can't be running on different nodes with the same name or volumes.  The machine controller was deleting the node before the machine was fully terminated, which means that a stateful set controller would be able to launch two pods with the same name running on the cluster at the same time, which violates our cluster safety guarantees.

Fix is to wait for machine is confirmed shut down by cloud provider before deleting the Node object.

Comment 3 sunzhaohua 2020-03-09 15:14:39 UTC
Verified
Tested on Azure, aws, gcp the timestamp shows node is deleted after the instance is deleted.
clusterversion: 4.3.5

Azure
I0309 09:23:58.764338       1 actuator.go:135] Deleting machine zhsunazure-vn5cs-worker-centralus3-t6sqz
I0309 09:23:58.764943       1 virtualmachines.go:225] deleting vm zhsunazure-vn5cs-worker-centralus3-t6sqz 
I0309 09:23:58.998472       1 virtualmachines.go:242] successfully deleted vm zhsunazure-vn5cs-worker-centralus3-t6sqz 
I0309 09:23:58.998498       1 disks.go:49] deleting disk zhsunazure-vn5cs-worker-centralus3-t6sqz_OSDisk
I0309 09:23:59.029511       1 disks.go:65] successfully deleted disk zhsunazure-vn5cs-worker-centralus3-t6sqz_OSDisk
I0309 09:23:59.029540       1 networkinterfaces.go:215] deleting nic zhsunazure-vn5cs-worker-centralus3-t6sqz-nic
I0309 09:24:09.226004       1 networkinterfaces.go:234] successfully deleted nic zhsunazure-vn5cs-worker-centralus3-t6sqz-nic
I0309 09:24:09.226070       1 machine_scope.go:180] zhsunazure-vn5cs-worker-centralus3-t6sqz: status unchanged
I0309 09:24:09.239968       1 controller.go:254] Deleting node "zhsunazure-vn5cs-worker-centralus3-t6sqz" for machine "zhsunazure-vn5cs-worker-centralus3-t6sqz"
I0309 09:24:09.240456       1 reflector.go:120] Starting reflector *v1.Node (10h0m0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204
I0309 09:24:09.240491       1 reflector.go:158] Listing and watching *v1.Node from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204
I0309 09:24:09.398866       1 controller.go:268] Machine "zhsunazure-vn5cs-worker-centralus3-t6sqz" deletion successful

AWS
I0309 09:40:36.447595       1 actuator.go:338] zhsunaws-9mm2f-worker-us-east-2b-jc9d2: deleting machine
I0309 09:40:36.526295       1 actuator.go:551] zhsunaws-9mm2f-worker-us-east-2b-jc9d2: Found instance by id: i-0d202b31e33653a48
I0309 09:40:36.526321       1 actuator.go:380] zhsunaws-9mm2f-worker-us-east-2b-jc9d2: found 1 existing instances for machine
I0309 09:40:36.526332       1 utils.go:196] Cleaning up extraneous instance for machine: i-0d202b31e33653a48, state: shutting-down, launchTime: 2020-03-09 07:50:51 +0000 UTC
I0309 09:40:36.526349       1 utils.go:200] Terminating i-0d202b31e33653a48 instance
I0309 09:40:36.636576       1 controller.go:254] Deleting node "ip-10-0-151-64.us-east-2.compute.internal" for machine "zhsunaws-9mm2f-worker-us-east-2b-jc9d2"
I0309 09:40:36.636617       1 controller.go:397] Node "ip-10-0-151-64.us-east-2.compute.internal" not found
I0309 09:40:36.651471       1 controller.go:268] Machine "zhsunaws-9mm2f-worker-us-east-2b-jc9d2" deletion successful

gcp
I0309 15:08:46.989893       1 info.go:20] drained node "zhsun8-2j9sv-w-b-r2lj5.c.openshift-qe.internal"
I0309 15:08:46.989906       1 controller.go:342] drain successful for machine "zhsun8-2j9sv-w-b-r2lj5"
I0309 15:08:46.989939       1 actuator.go:118] zhsun8-2j9sv-w-b-r2lj5: Deleting machine
I0309 15:08:47.223762       1 reconciler.go:302] zhsun8-2j9sv-w-b-r2lj5: Machine does not exist
I0309 15:08:47.223788       1 reconciler.go:319] zhsun8-2j9sv-w-b-r2lj5: Machine not found during delete, skipping
I0309 15:08:47.223829       1 controller.go:228] Deleting node "zhsun8-2j9sv-w-b-r2lj5.c.openshift-qe.internal" for machine "zhsun8-2j9sv-w-b-r2lj5"
I0309 15:08:47.224343       1 reflector.go:120] Starting reflector *v1.Node (10h0m0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204
I0309 15:08:47.224370       1 reflector.go:158] Listing and watching *v1.Node from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204
I0309 15:08:47.359139       1 controller.go:242] Machine "zhsun8-2j9sv-w-b-r2lj5" deletion successful

Comment 5 errata-xmlrpc 2020-03-24 14:33:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0858


Note You need to log in before you can comment on or make changes to this bug.