Bug 1707877 - cannot safely upgrade past Infrastructure API change
Summary: cannot safely upgrade past Infrastructure API change
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.1.0
Assignee: Abhinav Dahiya
QA Contact: Johnny Liu
Depends On:
TreeView+ depends on / blocked
Reported: 2019-05-08 15:29 UTC by David Eads
Modified: 2019-06-04 10:48 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-06-04 10:48:39 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift installer pull 1727 0 None None None 2019-05-08 16:15:59 UTC
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:48:47 UTC

Description David Eads 2019-05-08 15:29:20 UTC
https://github.com/openshift/installer/pull/1718 changed the meaning of an existing field, but did not include logic for and upgrade to fix the in-cluster values of the Infrastructure config resource.  Even if it had, because it changed the meaning of an existing field, clients would be stuck with conditional logic that treats the meaning of existent fields differently.  Without this

1. the new field is missing from the infrastructure resource
2. so the kas-o can't pull it and panics
3. which kills the cert regeneration for that one cert
4. which is used by the kubelets
5. which causes kubelets to stop trusting masters
6. which causes cluster destruction
7. so you run a recovery tool,
8. which promptly fails on the missing information (edited)
9. with "normal" timing, the clusters will self destruct in 30 days

This doesn't even get into the MCO and it rippling out destructive changes to every node.

Pulls are open to land this change

Comment 1 W. Trevor King 2019-05-08 16:15:26 UTC
Of the above PRs:

* https://github.com/openshift/machine-config-operator/pull/715 (replacing mco#714) is in the merge queue, and probably isn't a blocker for forward migration anyway.
* https://github.com/openshift/cluster-kube-apiserver-operator/pull/464 is in the merge queue, I just kicked its tests now that installer#1718 has landed.
* https://github.com/openshift/installer/pull/1727 is still open, and is what David is concerned about upgrading around in the kube API-server operator.  I'm not as concerned about upgrading around this as he is, but I don't mind holding beta5 for this either.

Comment 2 W. Trevor King 2019-05-08 18:38:20 UTC
mco@715 landed.  installer#1727 is still in the queue, but a prereq installer#1730 has landed.  kao#464 still possibly blocked on installer#1727, although folks are banging away on retests there in case it isn't.

Comment 3 W. Trevor King 2019-05-08 18:58:08 UTC
I dunno why I cleared metadata before, I don't remember touching those fields.  Mike has already recovered the target, and now I'm recovering POSTness.  Sorry for the confusion :/

Comment 4 Eric Paris 2019-05-08 21:03:45 UTC
I believe this is the only PR left:

Comment 5 David Eads 2019-05-09 00:33:57 UTC
Now merged.  Updating to modified.

Comment 7 Scott Dodson 2019-05-09 18:34:52 UTC
QE, Ultimately the way this was fixed was to make sure that everything was updated prior to cutting Beta 5 so that any potential upgrade from Beta5 should be expected to work. We didn't actually implement an upgrade migration strategy.

Comment 8 Peter Ruan 2019-05-09 23:17:01 UTC
I was able to do an upgrade from `4.1.0-0.nightly-2019-05-09-182710` to `4.1.0-0.nightly-2019-05-09-204138`, but I'll leave it to Johnny to decide if this can be moved to VERIFIED.

Comment 9 Johnny Liu 2019-05-10 08:59:06 UTC
Based on comment 7 and comment 8, move this bug to 'VERIFIED'. Thanks for Peter's testing.

Comment 11 errata-xmlrpc 2019-06-04 10:48:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.