1700931 – `oc delete machine` gets stuck when attempting to remove a machine, which runs the controller

Bug 1700931 - `oc delete machine` gets stuck when attempting to remove a machine, which runs the controller

Summary: `oc delete machine` gets stuck when attempting to remove a machine, which run...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Michael Gugino
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-17 17:02 UTC by Vadim Rutkovsky
Modified:	2019-06-04 10:47 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:47:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Logs of machine-api-controllers pods (16.03 KB, application/x-xz) 2019-04-17 17:02 UTC, Vadim Rutkovsky	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:47:55 UTC

Description Vadim Rutkovsky 2019-04-17 17:02:48 UTC

Created attachment 1555979 [details]
Logs of machine-api-controllers pods

Description of problem:


In Disaster Recovery CI jobs we need to remove two masters and restore etcd quorum. One of the first tasks to complete that is to destroy two master instance.

When its being done via Machine API second master on AWS won't get destroyed

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Run AWS cluster
2. Remove second master via `oc delete machine/..`
3. Attempt to remove third master

Result: last removal doesn't finish, instance for this machine is still running

Info:# oc get machines -n openshift-machine-api
NAME                                     INSTANCE              STATE     TYPE        REGION      ZONE         AGE
vrutkovs-zd5lp-master-0                  i-00bbc37ded7aedf6e   running   m4.xlarge   us-east-2   us-east-2a   15m
vrutkovs-zd5lp-master-1                  i-0f0279159c17a7e18   running   m4.xlarge   us-east-2   us-east-2b   15m
vrutkovs-zd5lp-master-2                  i-08b8c48c6798fabcd   running   m4.xlarge   us-east-2   us-east-2c   15m
vrutkovs-zd5lp-worker-us-east-2a-rvp6t   i-0baf2379ac79c92db   running   m4.large    us-east-2   us-east-2a   14m
vrutkovs-zd5lp-worker-us-east-2b-7c7v8   i-005ede30a5ed0cccb   running   m4.large    us-east-2   us-east-2b   14m
vrutkovs-zd5lp-worker-us-east-2c-6k9sj   i-0ba605762c95f1fd4   running   m4.large    us-east-2   us-east-2c   14m
# oc delete machine vrutkovs-zd5lp-master-1
machine.machine.openshift.io "vrutkovs-zd5lp-master-1" deleted
# oc delete machine vrutkovs-zd5lp-master-2
machine.machine.openshift.io "vrutkovs-zd5lp-master-2" deleted
<the call hangs here>
^C

Expected results:
master-2 gets removed too

Additional info:
Machine gets `deletionTimestamp` set as expected and the object can be removed when `finalized` in the spec is removed (that, of course, won't remove AWS instance)

Comment 1 Michael Gugino 2019-04-17 18:46:37 UTC

Logic in the (upstream) cluster-api prevents the machine controller from deleting a node that it's running on itself.

https://github.com/openshift/cluster-api/blob/master/pkg/controller/machine/controller.go#L308

This is probably a good check to have.  Logs indicate "Skipping reconciling of machine object" which is only printed for that one condition.

Comment 2 Vadim Rutkovsky 2019-04-17 20:43:52 UTC

Tested that removing other machine works, so machine-controller should show a better error message in the logs at least. Ideally it would also throw an error when machine is being removed

Comment 3 Michael Gugino 2019-04-18 13:47:08 UTC

PR to change message upstream: https://github.com/kubernetes-sigs/cluster-api/pull/905

Comment 4 Michael Gugino 2019-04-22 13:31:48 UTC

Downstream pick commit created: https://github.com/openshift/cluster-api/pull/30

Comment 5 Michael Gugino 2019-04-22 14:29:06 UTC

Cherry-pick to 4.1 branch: https://github.com/openshift/cluster-api/pull/31

Comment 6 Jan Chaloupka 2019-04-25 10:51:33 UTC

Already merged in aws-actuator as part of https://github.com/openshift/cluster-api-provider-aws/pull/203

Comment 8 Jianwei Hou 2019-05-05 05:28:07 UTC

Verified in 4.1.0-0.nightly-2019-05-04-054221

Steps:
1. Setup cluster
2. Delete master-1, then master-2

On deleting master-2, the `oc delete machine/...` hangs. Controller logged:
```
I0505 05:24:08.364167       1 controller.go:226] Machine "jhou1-j4fzc-master-1" deletion successful
I0505 05:24:18.686320       1 controller.go:129] Reconciling Machine "jhou1-j4fzc-master-2"
I0505 05:24:18.686346       1 controller.go:292] Machine "jhou1-j4fzc-master-2" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0505 05:24:18.686385       1 controller.go:189] Deleting machine hosting this controller is not allowed. Skipping reconciliation of machine "jhou1-j4fzc-master-2"
```

Cancel the delete command, the master-2 machine is not deleted.

Comment 10 errata-xmlrpc 2019-06-04 10:47:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.