I am currently trying to get the disruptive tests to run always, however the naive implementation in the test uses the machine api to select and delete machines (which in 4.3 or 4.4 the etcd operator will help automatically recover). However, sometimes the test hangs because of code in the cloud machine controller I0913 16:51:22.363617 1 controller.go:203] Deleting machine hosting this controller is not allowed. Skipping reconciliation of machine "ci-ln-wbhfl5t-d5d6b-2jqb5-master-0" Mike indicated that this is just code from upstream that is unnecessary and has a PR to fix. If we can merge that safely for 4.2 that unblocks the recovery test being implemented using the simple path (and is probably safer in the long run, since quorum guard already protects masters). Would like to see the PR https://github.com/openshift/cluster-api/pull/49 merged in 4.2 so we can unblock.
PR to address: https://github.com/openshift/cluster-api/pull/49 Would need to be vendored into all actuators.
https://github.com/openshift/cluster-api/pull/72 https://github.com/openshift/cluster-api-provider-aws/pull/260 https://github.com/openshift/cluster-api-provider-gcp/pull/62 https://github.com/openshift/cluster-api-provider-azure/pull/83 https://github.com/openshift/cluster-api-provider-openstack/pull/67
Merged on aws/gcp. Waiting for tests to go green on Azure. PR pending of approval on Openstack so assigning now to mfedosin for awareness
All PRs merged
Verified on 4.2.0-0.nightly-2019-09-18-114152 Deleting a machine that is hosting the machine-controller is now allowed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922