https://github.com/openshift/installer/pull/1718 changed the meaning of an existing field, but did not include logic for and upgrade to fix the in-cluster values of the Infrastructure config resource. Even if it had, because it changed the meaning of an existing field, clients would be stuck with conditional logic that treats the meaning of existent fields differently. Without this
1. the new field is missing from the infrastructure resource
2. so the kas-o can't pull it and panics
3. which kills the cert regeneration for that one cert
4. which is used by the kubelets
5. which causes kubelets to stop trusting masters
6. which causes cluster destruction
7. so you run a recovery tool,
8. which promptly fails on the missing information (edited)
9. with "normal" timing, the clusters will self destruct in 30 days
This doesn't even get into the MCO and it rippling out destructive changes to every node.
Pulls are open to land this change
Of the above PRs:
* https://github.com/openshift/machine-config-operator/pull/715 (replacing mco#714) is in the merge queue, and probably isn't a blocker for forward migration anyway.
* https://github.com/openshift/cluster-kube-apiserver-operator/pull/464 is in the merge queue, I just kicked its tests now that installer#1718 has landed.
* https://github.com/openshift/installer/pull/1727 is still open, and is what David is concerned about upgrading around in the kube API-server operator. I'm not as concerned about upgrading around this as he is, but I don't mind holding beta5 for this either.
mco@715 landed. installer#1727 is still in the queue, but a prereq installer#1730 has landed. kao#464 still possibly blocked on installer#1727, although folks are banging away on retests there in case it isn't.
I dunno why I cleared metadata before, I don't remember touching those fields. Mike has already recovered the target, and now I'm recovering POSTness. Sorry for the confusion :/
I believe this is the only PR left:
Now merged. Updating to modified.
QE, Ultimately the way this was fixed was to make sure that everything was updated prior to cutting Beta 5 so that any potential upgrade from Beta5 should be expected to work. We didn't actually implement an upgrade migration strategy.
I was able to do an upgrade from `4.1.0-0.nightly-2019-05-09-182710` to `4.1.0-0.nightly-2019-05-09-204138`, but I'll leave it to Johnny to decide if this can be moved to VERIFIED.
Based on comment 7 and comment 8, move this bug to 'VERIFIED'. Thanks for Peter's testing.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.