Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1841328

Summary:	Inability to drain node should not block machine-controller
Product:	OpenShift Container Platform	Reporter:	Christoph Blecker <cblecker>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	Jianwei Hou <jhou>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	mgugino, wking
Version:	4.3.z	Keywords:	ServiceDeliveryImpact
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-28 22:56:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Christoph Blecker 2020-05-28 21:06:02 UTC

Description of problem:
If a node is unhealthy and being replaced (either by a MachineHealthCheck, or by an operator manually deleting the Machine object), the machine-controller will attempt to drain the machine.

However, if the node is unhealthy in such a way that a drain cannot succeed (e.g. node is completely offline or is stopped), the drain will fail.

If the drain fails, the whole loop is requeued, and the machine is never properly replaced.

Version-Release number of selected component (if applicable):
4.3.18


How reproducible:
Consistently

Steps to Reproduce:
1. Provision a cluster (e.g. an IPI AWS cluster)
2. Stop one of the machines from the cloud provider (e.g. Stop the AWS instance directly from the console)
3. When the node goes unready, delete the machine object (this can also be accomplished automatically with a machine 

Actual results:
Drain fails and machine is never deleted

Expected results:
If drain fails, the machine-controller would move on and delete the machine anyways


Additional info:

Comment 2 Michael Gugino 2020-05-28 22:56:12 UTC

This is fixed in all current versions of 4.4.  It's unlikely to get backported to 4.3.

Comment 3 W. Trevor King 2020-06-27 03:17:58 UTC

I think bug 1733474 is the one that landed the fix for deleting unreachable nodes.  It was cloned back to 4.3.z as bug 1803762, which was closed WONTFIX with "we won't be backporting".  Marking this one as a dup of bug 1803762.

*** This bug has been marked as a duplicate of bug 1803762 ***