Hide Forgot
Created attachment 1555979 [details] Logs of machine-api-controllers pods Description of problem: In Disaster Recovery CI jobs we need to remove two masters and restore etcd quorum. One of the first tasks to complete that is to destroy two master instance. When its being done via Machine API second master on AWS won't get destroyed Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Run AWS cluster 2. Remove second master via `oc delete machine/..` 3. Attempt to remove third master Result: last removal doesn't finish, instance for this machine is still running Info:# oc get machines -n openshift-machine-api NAME INSTANCE STATE TYPE REGION ZONE AGE vrutkovs-zd5lp-master-0 i-00bbc37ded7aedf6e running m4.xlarge us-east-2 us-east-2a 15m vrutkovs-zd5lp-master-1 i-0f0279159c17a7e18 running m4.xlarge us-east-2 us-east-2b 15m vrutkovs-zd5lp-master-2 i-08b8c48c6798fabcd running m4.xlarge us-east-2 us-east-2c 15m vrutkovs-zd5lp-worker-us-east-2a-rvp6t i-0baf2379ac79c92db running m4.large us-east-2 us-east-2a 14m vrutkovs-zd5lp-worker-us-east-2b-7c7v8 i-005ede30a5ed0cccb running m4.large us-east-2 us-east-2b 14m vrutkovs-zd5lp-worker-us-east-2c-6k9sj i-0ba605762c95f1fd4 running m4.large us-east-2 us-east-2c 14m # oc delete machine vrutkovs-zd5lp-master-1 machine.machine.openshift.io "vrutkovs-zd5lp-master-1" deleted # oc delete machine vrutkovs-zd5lp-master-2 machine.machine.openshift.io "vrutkovs-zd5lp-master-2" deleted <the call hangs here> ^C Expected results: master-2 gets removed too Additional info: Machine gets `deletionTimestamp` set as expected and the object can be removed when `finalized` in the spec is removed (that, of course, won't remove AWS instance)
Logic in the (upstream) cluster-api prevents the machine controller from deleting a node that it's running on itself. https://github.com/openshift/cluster-api/blob/master/pkg/controller/machine/controller.go#L308 This is probably a good check to have. Logs indicate "Skipping reconciling of machine object" which is only printed for that one condition.
Tested that removing other machine works, so machine-controller should show a better error message in the logs at least. Ideally it would also throw an error when machine is being removed
PR to change message upstream: https://github.com/kubernetes-sigs/cluster-api/pull/905
Downstream pick commit created: https://github.com/openshift/cluster-api/pull/30
Cherry-pick to 4.1 branch: https://github.com/openshift/cluster-api/pull/31
Already merged in aws-actuator as part of https://github.com/openshift/cluster-api-provider-aws/pull/203
Verified in 4.1.0-0.nightly-2019-05-04-054221 Steps: 1. Setup cluster 2. Delete master-1, then master-2 On deleting master-2, the `oc delete machine/...` hangs. Controller logged: ``` I0505 05:24:08.364167 1 controller.go:226] Machine "jhou1-j4fzc-master-1" deletion successful I0505 05:24:18.686320 1 controller.go:129] Reconciling Machine "jhou1-j4fzc-master-2" I0505 05:24:18.686346 1 controller.go:292] Machine "jhou1-j4fzc-master-2" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0505 05:24:18.686385 1 controller.go:189] Deleting machine hosting this controller is not allowed. Skipping reconciliation of machine "jhou1-j4fzc-master-2" ``` Cancel the delete command, the master-2 machine is not deleted.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758