Bug 1829664

Summary:	Guard against 4.3 -> 4.4 -> 4.5 with compute nodes stuck on drain
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Machine Config Operator	Assignee:	Erica von Buelow <evb>
Status:	CLOSED DEFERRED	QA Contact:	Michael Nguyen <mnguyen>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.3.z	CC:	amurdaca, scuppett, smilner
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-01 11:25:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-04-30 02:43:48 UTC

Bug 1822269 (and possibly others in that space) can result in MachineSets that are stuck in old versions due to stuck-in-terminating pods blocking the drain.  Current resolution involves manually killing the stuck pods.  There is also an RFE in the queue about setting Upgradeable=False if there are lagging compute pools like that, to make sure everything is good to go before taking a minor-bumping leap from 4.y -> 4.(y+1).  We should either land that Upgradeable=False guard in 4.4 (and possibly 4.3) to prevent:

1. Compute stuck on 4.3 with stuck-in-terminating nodes
2. User doesn't notice, updates to 4.4.
3. Still doesn't notice, updates to 4.5.
4. 4.3 nodes vs. 4.5 control plane fireworks (not actually sure this would happen, but even if it doesn't, 4.3 nodes vs. a 4.5 control plane is definitely going to be a situation that doesn't get a lot of soak time in CI).

Doesn't have to be the Upgradeable=False guard; as long as we ensure everything made it to 4.4 before the 4.4 -> 4.5 jump, we shouldn't have a problem.

Comment 1 W. Trevor King 2020-04-30 17:22:20 UTC

This is similar to bug 1829999, but they're slightly different.  This one is about blocking folks from making multiple minor bumps while folks have stuck machines that lag behind by a minor version.  That one is about unsticking stuck machines so folks can avoid that block.

Comment 3 Ben Parees 2020-05-11 17:29:34 UTC

Will this be backported to 4.4 or earlier?  I see no clone.

Comment 5 Antonio Murdaca 2020-06-16 12:58:00 UTC

Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.

Comment 6 Antonio Murdaca 2020-07-10 09:40:40 UTC

Adding UpcomingSprint as I won't be able to finish this by the current sprint. I'll revisit from next week.

Comment 7 Antonio Murdaca 2020-09-01 11:25:31 UTC

this is captured in https://issues.redhat.com/browse/GRPA-2693

Comment 8 W. Trevor King 2020-12-04 03:47:19 UTC

https://github.com/openshift/machine-config-operator/pull/2231 should give us this guard in 4.7 and later, by setting Upgradeable=False if any machine-config pools are degraded :)