Deleting the Node object in Kubernetes signals to the cluster that no pod on that node is running and that it is safe to release any storage or process locks that ensure two processes can't be running on different nodes with the same name or volumes. The machine controller was deleting the node before the machine was fully terminated, which means that a stateful set controller would be able to launch two pods with the same name running on the cluster at the same time, which violates our cluster safety guarantees. Fix is to wait for machine is confirmed shut down by cloud provider before deleting the Node object.
Can you specify how to reproduce this? The machine controller is expected to drain the node, then ask the provider to delete the instance, then delete the node. https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L209-L229.
Moving back to as this needs to be verified only after: https://github.com/openshift/cluster-api-provider-azure/pull/105 https://github.com/openshift/cluster-api-provider-aws/pull/296 https://github.com/openshift/cluster-api-provider-gcp/pull/74 are merged which also fix https://bugzilla.redhat.com/show_bug.cgi?id=1733474
Tested on Azure, the timestamp shows node is deleted after the instance is deleted. I0219 10:10:51.591121 1 controller.go:332] Could not find node from noderef, it may have already been deleted: qe-jhou-rqfpj-worker-centralus2-tqdvz I0219 10:10:51.591141 1 actuator.go:128] Deleting machine qe-jhou-rqfpj-worker-centralus2-tqdvz I0219 10:10:51.591810 1 virtualmachines.go:225] deleting vm qe-jhou-rqfpj-worker-centralus2-tqdvz I0219 10:10:51.815731 1 virtualmachines.go:242] successfully deleted vm qe-jhou-rqfpj-worker-centralus2-tqdvz I0219 10:10:51.815758 1 disks.go:49] deleting disk qe-jhou-rqfpj-worker-centralus2-tqdvz_OSDisk I0219 10:10:51.868334 1 disks.go:65] successfully deleted disk qe-jhou-rqfpj-worker-centralus2-tqdvz_OSDisk I0219 10:10:51.868370 1 networkinterfaces.go:215] deleting nic qe-jhou-rqfpj-worker-centralus2-tqdvz-nic I0219 10:11:02.111140 1 networkinterfaces.go:234] successfully deleted nic qe-jhou-rqfpj-worker-centralus2-tqdvz-nic I0219 10:11:02.111240 1 machine_scope.go:170] qe-jhou-rqfpj-worker-centralus2-tqdvz: status unchanged I0219 10:11:02.111284 1 machine_scope.go:170] qe-jhou-rqfpj-worker-centralus2-tqdvz: status unchanged I0219 10:11:02.111293 1 machine_scope.go:190] qe-jhou-rqfpj-worker-centralus2-tqdvz: patching machine I0219 10:11:02.125843 1 actuator.go:196] Checking if machine qe-jhou-rqfpj-worker-centralus2-tqdvz exists I0219 10:11:02.365445 1 controller.go:241] qe-jhou-rqfpj-worker-centralus2-tqdvz: deleting node "qe-jhou-rqfpj-worker-centralus2-tqdvz" for machine I0219 10:11:02.365493 1 controller.go:388] Node "qe-jhou-rqfpj-worker-centralus2-tqdvz" not found I0219 10:11:02.395353 1 controller.go:255] qe-jhou-rqfpj-worker-centralus2-tqdvz: machine deletion successful Will also cover the rest provider before moving this to verified.
Verified on AWS:4.4.0-0.nightly-2020-02-18-005549 Machine controller deletes node object until instance is terminated, we can see logs bellow: can't proceed deleting machine while cloud instance is being terminated Failed to find existing instance by id i-08621c75e167a4f68: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down Instance does not exist deleting node "zhsun4-s2gcg-worker-us-east-2c-ftk9g" for machine machine deletion successful I0219 03:09:21.531611 1 actuator.go:538] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Found instance by id: i-08621c75e167a4f68 I0219 03:09:21.531646 1 controller.go:236] zhsun4-s2gcg-worker-us-east-2c-ftk9g: can't proceed deleting machine while cloud instance is being terminated, requeuing I0219 03:09:21.531712 1 controller.go:163] zhsun4-s2gcg-worker-us-east-2c-jph8g: reconciling Machine I0219 ... I0219 03:09:51.284449 1 actuator.go:315] zhsun4-s2gcg-worker-us-east-2c-ftk9g: deleting machine W0219 03:09:51.374462 1 actuator.go:535] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Failed to find existing instance by id i-08621c75e167a4f68: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down E0219 03:09:51.468978 1 utils.go:165] Excluding instance matching zhsun4-s2gcg-worker-us-east-2c-ftk9g: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down I0219 03:09:51.469018 1 actuator.go:357] zhsun4-s2gcg-worker-us-east-2c-ftk9g: found 0 existing instances for machine W0219 03:09:51.469032 1 actuator.go:359] zhsun4-s2gcg-worker-us-east-2c-ftk9g: no instances found to delete for machine I0219 03:09:51.469042 1 actuator.go:489] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Checking if machine exists W0219 03:09:51.531060 1 actuator.go:535] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Failed to find existing instance by id i-08621c75e167a4f68: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down E0219 03:09:51.591483 1 utils.go:165] Excluding instance matching zhsun4-s2gcg-worker-us-east-2c-ftk9g: instance i-08621c75e167a4f68 state "terminated" is not in running, pending, stopped, stopping, shutting-down I0219 03:09:51.591514 1 actuator.go:502] zhsun4-s2gcg-worker-us-east-2c-ftk9g: Instance does not exist I0219 03:09:51.591529 1 controller.go:241] ip-10-0-169-219.us-east-2.compute.internal: deleting node "zhsun4-s2gcg-worker-us-east-2c-ftk9g" for machine I0219 03:09:51.631704 1 controller.go:255] zhsun4-s2gcg-worker-us-east-2c-ftk9g: machine deletion successful
Verified on gcp, by looking the timestamp node is deleted after the instance is deleted. clusterversion: 4.4.0-0.nightly-2020-02-19-173908 I0221 02:34:07.226447 1 controller.go:321] Could not find node from noderef, it may have already been deleted: zhsung-nrjcj-w-a-fksnz.c.openshift-qe.internal I0221 02:34:07.226487 1 actuator.go:118] zhsung-nrjcj-w-a-fksnz: Deleting machine I0221 02:34:07.441236 1 reconciler.go:302] zhsung-nrjcj-w-a-fksnz: Machine does not exist I0221 02:34:07.441442 1 reconciler.go:319] zhsung-nrjcj-w-a-fksnz: Machine not found during delete, skipping I0221 02:34:07.441547 1 controller.go:230] zhsung-nrjcj-w-a-fksnz.c.openshift-qe.internal: deleting node "zhsung-nrjcj-w-a-fksnz" for machine I0221 02:34:07.441666 1 controller.go:377] Node "zhsung-nrjcj-w-a-fksnz.c.openshift-qe.internal" not found I0221 02:34:07.472003 1 controller.go:244] zhsung-nrjcj-w-a-fksnz: machine deletion successful
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581