Bug 1822269 (and possibly others in that space) can result in MachineSets that are stuck in old versions due to stuck-in-terminating pods blocking the drain. Current resolution involves manually killing the stuck pods. There is also an RFE in the queue about setting Upgradeable=False if there are lagging compute pools like that, to make sure everything is good to go before taking a minor-bumping leap from 4.y -> 4.(y+1). We should either land that Upgradeable=False guard in 4.4 (and possibly 4.3) to prevent: 1. Compute stuck on 4.3 with stuck-in-terminating nodes 2. User doesn't notice, updates to 4.4. 3. Still doesn't notice, updates to 4.5. 4. 4.3 nodes vs. 4.5 control plane fireworks (not actually sure this would happen, but even if it doesn't, 4.3 nodes vs. a 4.5 control plane is definitely going to be a situation that doesn't get a lot of soak time in CI). Doesn't have to be the Upgradeable=False guard; as long as we ensure everything made it to 4.4 before the 4.4 -> 4.5 jump, we shouldn't have a problem.
This is similar to bug 1829999, but they're slightly different. This one is about blocking folks from making multiple minor bumps while folks have stuck machines that lag behind by a minor version. That one is about unsticking stuck machines so folks can avoid that block.
Will this be backported to 4.4 or earlier? I see no clone.
Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.
Adding UpcomingSprint as I won't be able to finish this by the current sprint. I'll revisit from next week.
this is captured in https://issues.redhat.com/browse/GRPA-2693
https://github.com/openshift/machine-config-operator/pull/2231 should give us this guard in 4.7 and later, by setting Upgradeable=False if any machine-config pools are degraded :)