Bug 1707877

Summary: cannot safely upgrade past Infrastructure API change
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: eparis, mifiedle, pruan, wking
Version: 4.1.0Keywords: BetaBlocker
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Eads 2019-05-08 15:29:20 UTC
https://github.com/openshift/installer/pull/1718 changed the meaning of an existing field, but did not include logic for and upgrade to fix the in-cluster values of the Infrastructure config resource.  Even if it had, because it changed the meaning of an existing field, clients would be stuck with conditional logic that treats the meaning of existent fields differently.  Without this


1. the new field is missing from the infrastructure resource
2. so the kas-o can't pull it and panics
3. which kills the cert regeneration for that one cert
4. which is used by the kubelets
5. which causes kubelets to stop trusting masters
6. which causes cluster destruction
7. so you run a recovery tool,
8. which promptly fails on the missing information (edited)
9. with "normal" timing, the clusters will self destruct in 30 days

This doesn't even get into the MCO and it rippling out destructive changes to every node.
 

Pulls are open to land this change
https://github.com/openshift/api/pull/308
https://github.com/openshift/installer/pull/1718
https://github.com/openshift/cluster-kube-apiserver-operator/pull/464
https://github.com/openshift/machine-config-operator/pull/714
https://github.com/openshift/installer/pull/1727
https://github.com/openshift/cluster-kube-apiserver-operator/pull/465

Comment 1 W. Trevor King 2019-05-08 16:15:26 UTC
Of the above PRs:

* https://github.com/openshift/machine-config-operator/pull/715 (replacing mco#714) is in the merge queue, and probably isn't a blocker for forward migration anyway.
* https://github.com/openshift/cluster-kube-apiserver-operator/pull/464 is in the merge queue, I just kicked its tests now that installer#1718 has landed.
* https://github.com/openshift/installer/pull/1727 is still open, and is what David is concerned about upgrading around in the kube API-server operator.  I'm not as concerned about upgrading around this as he is, but I don't mind holding beta5 for this either.

Comment 2 W. Trevor King 2019-05-08 18:38:20 UTC
mco@715 landed.  installer#1727 is still in the queue, but a prereq installer#1730 has landed.  kao#464 still possibly blocked on installer#1727, although folks are banging away on retests there in case it isn't.

Comment 3 W. Trevor King 2019-05-08 18:58:08 UTC
I dunno why I cleared metadata before, I don't remember touching those fields.  Mike has already recovered the target, and now I'm recovering POSTness.  Sorry for the confusion :/

Comment 4 Eric Paris 2019-05-08 21:03:45 UTC
I believe this is the only PR left:
https://github.com/openshift/cluster-kube-apiserver-operator/pull/464

Comment 5 David Eads 2019-05-09 00:33:57 UTC
Now merged.  Updating to modified.

Comment 7 Scott Dodson 2019-05-09 18:34:52 UTC
QE, Ultimately the way this was fixed was to make sure that everything was updated prior to cutting Beta 5 so that any potential upgrade from Beta5 should be expected to work. We didn't actually implement an upgrade migration strategy.

Comment 8 Peter Ruan 2019-05-09 23:17:01 UTC
I was able to do an upgrade from `4.1.0-0.nightly-2019-05-09-182710` to `4.1.0-0.nightly-2019-05-09-204138`, but I'll leave it to Johnny to decide if this can be moved to VERIFIED.

Comment 9 Johnny Liu 2019-05-10 08:59:06 UTC
Based on comment 7 and comment 8, move this bug to 'VERIFIED'. Thanks for Peter's testing.

Comment 11 errata-xmlrpc 2019-06-04 10:48:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758