Bug 1700931

Summary: `oc delete machine` gets stuck when attempting to remove a machine, which runs the controller
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: Cloud ComputeAssignee: Michael Gugino <mgugino>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: agarcial, jchaloup, mgugino, sdodson, wking
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs of machine-api-controllers pods none

Description Vadim Rutkovsky 2019-04-17 17:02:48 UTC
Created attachment 1555979 [details]
Logs of machine-api-controllers pods

Description of problem:


In Disaster Recovery CI jobs we need to remove two masters and restore etcd quorum. One of the first tasks to complete that is to destroy two master instance.

When its being done via Machine API second master on AWS won't get destroyed

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Run AWS cluster
2. Remove second master via `oc delete machine/..`
3. Attempt to remove third master

Result: last removal doesn't finish, instance for this machine is still running

Info:# oc get machines -n openshift-machine-api
NAME                                     INSTANCE              STATE     TYPE        REGION      ZONE         AGE
vrutkovs-zd5lp-master-0                  i-00bbc37ded7aedf6e   running   m4.xlarge   us-east-2   us-east-2a   15m
vrutkovs-zd5lp-master-1                  i-0f0279159c17a7e18   running   m4.xlarge   us-east-2   us-east-2b   15m
vrutkovs-zd5lp-master-2                  i-08b8c48c6798fabcd   running   m4.xlarge   us-east-2   us-east-2c   15m
vrutkovs-zd5lp-worker-us-east-2a-rvp6t   i-0baf2379ac79c92db   running   m4.large    us-east-2   us-east-2a   14m
vrutkovs-zd5lp-worker-us-east-2b-7c7v8   i-005ede30a5ed0cccb   running   m4.large    us-east-2   us-east-2b   14m
vrutkovs-zd5lp-worker-us-east-2c-6k9sj   i-0ba605762c95f1fd4   running   m4.large    us-east-2   us-east-2c   14m
# oc delete machine vrutkovs-zd5lp-master-1
machine.machine.openshift.io "vrutkovs-zd5lp-master-1" deleted
# oc delete machine vrutkovs-zd5lp-master-2
machine.machine.openshift.io "vrutkovs-zd5lp-master-2" deleted
<the call hangs here>
^C

Expected results:
master-2 gets removed too

Additional info:
Machine gets `deletionTimestamp` set as expected and the object can be removed when `finalized` in the spec is removed (that, of course, won't remove AWS instance)

Comment 1 Michael Gugino 2019-04-17 18:46:37 UTC
Logic in the (upstream) cluster-api prevents the machine controller from deleting a node that it's running on itself.

https://github.com/openshift/cluster-api/blob/master/pkg/controller/machine/controller.go#L308

This is probably a good check to have.  Logs indicate "Skipping reconciling of machine object" which is only printed for that one condition.

Comment 2 Vadim Rutkovsky 2019-04-17 20:43:52 UTC
Tested that removing other machine works, so machine-controller should show a better error message in the logs at least. Ideally it would also throw an error when machine is being removed

Comment 3 Michael Gugino 2019-04-18 13:47:08 UTC
PR to change message upstream: https://github.com/kubernetes-sigs/cluster-api/pull/905

Comment 4 Michael Gugino 2019-04-22 13:31:48 UTC
Downstream pick commit created: https://github.com/openshift/cluster-api/pull/30

Comment 5 Michael Gugino 2019-04-22 14:29:06 UTC
Cherry-pick to 4.1 branch: https://github.com/openshift/cluster-api/pull/31

Comment 6 Jan Chaloupka 2019-04-25 10:51:33 UTC
Already merged in aws-actuator as part of https://github.com/openshift/cluster-api-provider-aws/pull/203

Comment 8 Jianwei Hou 2019-05-05 05:28:07 UTC
Verified in 4.1.0-0.nightly-2019-05-04-054221

Steps:
1. Setup cluster
2. Delete master-1, then master-2

On deleting master-2, the `oc delete machine/...` hangs. Controller logged:
```
I0505 05:24:08.364167       1 controller.go:226] Machine "jhou1-j4fzc-master-1" deletion successful
I0505 05:24:18.686320       1 controller.go:129] Reconciling Machine "jhou1-j4fzc-master-2"
I0505 05:24:18.686346       1 controller.go:292] Machine "jhou1-j4fzc-master-2" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0505 05:24:18.686385       1 controller.go:189] Deleting machine hosting this controller is not allowed. Skipping reconciliation of machine "jhou1-j4fzc-master-2"
```

Cancel the delete command, the master-2 machine is not deleted.

Comment 10 errata-xmlrpc 2019-06-04 10:47:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758