Bug 1752088

Summary: Recovery e2e test hangs when cloud-controller pod is on machine that is chosen for deletion
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Cloud ComputeAssignee: Mike Fedosin <mfedosin>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: agarcial, mgugino, wking
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:41:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-09-13 17:02:27 UTC
I am currently trying to get the disruptive tests to run always, however the naive implementation in the test uses the machine api to select and delete machines (which in 4.3 or 4.4 the etcd operator will help automatically recover).

However, sometimes the test hangs because of code in the cloud machine controller

I0913 16:51:22.363617       1 controller.go:203] Deleting machine hosting this controller is not allowed. Skipping reconciliation of machine "ci-ln-wbhfl5t-d5d6b-2jqb5-master-0"

Mike indicated that this is just code from upstream that is unnecessary and has a PR to fix.  If we can merge that safely for 4.2 that unblocks the recovery test being implemented using the simple path (and is probably safer in the long run, since quorum guard already protects masters).

Would like to see the PR https://github.com/openshift/cluster-api/pull/49 merged in 4.2 so we can unblock.

Comment 1 Michael Gugino 2019-09-13 17:05:51 UTC
PR to address: https://github.com/openshift/cluster-api/pull/49

Would need to be vendored into all actuators.

Comment 3 Alberto 2019-09-17 12:49:59 UTC
Merged on aws/gcp. Waiting for tests to go green on Azure.
PR pending of approval on Openstack so assigning now to mfedosin for awareness

Comment 4 Alberto 2019-09-17 13:35:35 UTC
All PRs merged

Comment 6 Jianwei Hou 2019-09-19 06:55:41 UTC
Verified on 4.2.0-0.nightly-2019-09-18-114152

Deleting a machine that is hosting the machine-controller is now allowed.

Comment 7 errata-xmlrpc 2019-10-16 06:41:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922