Bug 1829664 - Guard against 4.3 -> 4.4 -> 4.5 with compute nodes stuck on drain
Summary: Guard against 4.3 -> 4.4 -> 4.5 with compute nodes stuck on drain
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Erica von Buelow
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-30 02:43 UTC by W. Trevor King
Modified: 2020-12-04 03:47 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-01 11:25:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2020-04-30 02:43:48 UTC
Bug 1822269 (and possibly others in that space) can result in MachineSets that are stuck in old versions due to stuck-in-terminating pods blocking the drain.  Current resolution involves manually killing the stuck pods.  There is also an RFE in the queue about setting Upgradeable=False if there are lagging compute pools like that, to make sure everything is good to go before taking a minor-bumping leap from 4.y -> 4.(y+1).  We should either land that Upgradeable=False guard in 4.4 (and possibly 4.3) to prevent:

1. Compute stuck on 4.3 with stuck-in-terminating nodes
2. User doesn't notice, updates to 4.4.
3. Still doesn't notice, updates to 4.5.
4. 4.3 nodes vs. 4.5 control plane fireworks (not actually sure this would happen, but even if it doesn't, 4.3 nodes vs. a 4.5 control plane is definitely going to be a situation that doesn't get a lot of soak time in CI).

Doesn't have to be the Upgradeable=False guard; as long as we ensure everything made it to 4.4 before the 4.4 -> 4.5 jump, we shouldn't have a problem.

Comment 1 W. Trevor King 2020-04-30 17:22:20 UTC
This is similar to bug 1829999, but they're slightly different.  This one is about blocking folks from making multiple minor bumps while folks have stuck machines that lag behind by a minor version.  That one is about unsticking stuck machines so folks can avoid that block.

Comment 3 Ben Parees 2020-05-11 17:29:34 UTC
Will this be backported to 4.4 or earlier?  I see no clone.

Comment 5 Antonio Murdaca 2020-06-16 12:58:00 UTC
Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.

Comment 6 Antonio Murdaca 2020-07-10 09:40:40 UTC
Adding UpcomingSprint as I won't be able to finish this by the current sprint. I'll revisit from next week.

Comment 7 Antonio Murdaca 2020-09-01 11:25:31 UTC
this is captured in https://issues.redhat.com/browse/GRPA-2693

Comment 8 W. Trevor King 2020-12-04 03:47:19 UTC
https://github.com/openshift/machine-config-operator/pull/2231 should give us this guard in 4.7 and later, by setting Upgradeable=False if any machine-config pools are degraded :)


Note You need to log in before you can comment on or make changes to this bug.