1829664 – Guard against 4.3 -> 4.4 -> 4.5 with compute nodes stuck on drain

Bug 1829664 - Guard against 4.3 -> 4.4 -> 4.5 with compute nodes stuck on drain

Summary: Guard against 4.3 -> 4.4 -> 4.5 with compute nodes stuck on drain

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Erica von Buelow
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-30 02:43 UTC by W. Trevor King
Modified:	2020-12-04 03:47 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-01 11:25:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2020-04-30 02:43:48 UTC

Bug 1822269 (and possibly others in that space) can result in MachineSets that are stuck in old versions due to stuck-in-terminating pods blocking the drain.  Current resolution involves manually killing the stuck pods.  There is also an RFE in the queue about setting Upgradeable=False if there are lagging compute pools like that, to make sure everything is good to go before taking a minor-bumping leap from 4.y -> 4.(y+1).  We should either land that Upgradeable=False guard in 4.4 (and possibly 4.3) to prevent:

1. Compute stuck on 4.3 with stuck-in-terminating nodes
2. User doesn't notice, updates to 4.4.
3. Still doesn't notice, updates to 4.5.
4. 4.3 nodes vs. 4.5 control plane fireworks (not actually sure this would happen, but even if it doesn't, 4.3 nodes vs. a 4.5 control plane is definitely going to be a situation that doesn't get a lot of soak time in CI).

Doesn't have to be the Upgradeable=False guard; as long as we ensure everything made it to 4.4 before the 4.4 -> 4.5 jump, we shouldn't have a problem.

Comment 1 W. Trevor King 2020-04-30 17:22:20 UTC

This is similar to bug 1829999, but they're slightly different.  This one is about blocking folks from making multiple minor bumps while folks have stuck machines that lag behind by a minor version.  That one is about unsticking stuck machines so folks can avoid that block.

Comment 3 Ben Parees 2020-05-11 17:29:34 UTC

Will this be backported to 4.4 or earlier?  I see no clone.

Comment 5 Antonio Murdaca 2020-06-16 12:58:00 UTC

Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.

Comment 6 Antonio Murdaca 2020-07-10 09:40:40 UTC

Adding UpcomingSprint as I won't be able to finish this by the current sprint. I'll revisit from next week.

Comment 7 Antonio Murdaca 2020-09-01 11:25:31 UTC

this is captured in https://issues.redhat.com/browse/GRPA-2693

Comment 8 W. Trevor King 2020-12-04 03:47:19 UTC

https://github.com/openshift/machine-config-operator/pull/2231 should give us this guard in 4.7 and later, by setting Upgradeable=False if any machine-config pools are degraded :)

Note You need to log in before you can comment on or make changes to this bug.