Created attachment 1555979 [details]
Logs of machine-api-controllers pods
Description of problem:
In Disaster Recovery CI jobs we need to remove two masters and restore etcd quorum. One of the first tasks to complete that is to destroy two master instance.
When its being done via Machine API second master on AWS won't get destroyed
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Run AWS cluster
2. Remove second master via `oc delete machine/..`
3. Attempt to remove third master
Result: last removal doesn't finish, instance for this machine is still running
Info:# oc get machines -n openshift-machine-api
NAME INSTANCE STATE TYPE REGION ZONE AGE
vrutkovs-zd5lp-master-0 i-00bbc37ded7aedf6e running m4.xlarge us-east-2 us-east-2a 15m
vrutkovs-zd5lp-master-1 i-0f0279159c17a7e18 running m4.xlarge us-east-2 us-east-2b 15m
vrutkovs-zd5lp-master-2 i-08b8c48c6798fabcd running m4.xlarge us-east-2 us-east-2c 15m
vrutkovs-zd5lp-worker-us-east-2a-rvp6t i-0baf2379ac79c92db running m4.large us-east-2 us-east-2a 14m
vrutkovs-zd5lp-worker-us-east-2b-7c7v8 i-005ede30a5ed0cccb running m4.large us-east-2 us-east-2b 14m
vrutkovs-zd5lp-worker-us-east-2c-6k9sj i-0ba605762c95f1fd4 running m4.large us-east-2 us-east-2c 14m
# oc delete machine vrutkovs-zd5lp-master-1
machine.machine.openshift.io "vrutkovs-zd5lp-master-1" deleted
# oc delete machine vrutkovs-zd5lp-master-2
machine.machine.openshift.io "vrutkovs-zd5lp-master-2" deleted
<the call hangs here>
master-2 gets removed too
Machine gets `deletionTimestamp` set as expected and the object can be removed when `finalized` in the spec is removed (that, of course, won't remove AWS instance)
Logic in the (upstream) cluster-api prevents the machine controller from deleting a node that it's running on itself.
This is probably a good check to have. Logs indicate "Skipping reconciling of machine object" which is only printed for that one condition.
Tested that removing other machine works, so machine-controller should show a better error message in the logs at least. Ideally it would also throw an error when machine is being removed
PR to change message upstream: https://github.com/kubernetes-sigs/cluster-api/pull/905
Downstream pick commit created: https://github.com/openshift/cluster-api/pull/30
Cherry-pick to 4.1 branch: https://github.com/openshift/cluster-api/pull/31
Already merged in aws-actuator as part of https://github.com/openshift/cluster-api-provider-aws/pull/203
Verified in 4.1.0-0.nightly-2019-05-04-054221
1. Setup cluster
2. Delete master-1, then master-2
On deleting master-2, the `oc delete machine/...` hangs. Controller logged:
I0505 05:24:08.364167 1 controller.go:226] Machine "jhou1-j4fzc-master-1" deletion successful
I0505 05:24:18.686320 1 controller.go:129] Reconciling Machine "jhou1-j4fzc-master-2"
I0505 05:24:18.686346 1 controller.go:292] Machine "jhou1-j4fzc-master-2" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0505 05:24:18.686385 1 controller.go:189] Deleting machine hosting this controller is not allowed. Skipping reconciliation of machine "jhou1-j4fzc-master-2"
Cancel the delete command, the master-2 machine is not deleted.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.