1707877 – cannot safely upgrade past Infrastructure API change

Bug 1707877 - cannot safely upgrade past Infrastructure API change

Summary: cannot safely upgrade past Infrastructure API change

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Abhinav Dahiya
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-08 15:29 UTC by David Eads
Modified:	2019-06-04 10:48 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:48:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 1727	0	None	None	None	2019-05-08 16:15:59 UTC
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:48:47 UTC

Description David Eads 2019-05-08 15:29:20 UTC

https://github.com/openshift/installer/pull/1718 changed the meaning of an existing field, but did not include logic for and upgrade to fix the in-cluster values of the Infrastructure config resource.  Even if it had, because it changed the meaning of an existing field, clients would be stuck with conditional logic that treats the meaning of existent fields differently.  Without this


1. the new field is missing from the infrastructure resource
2. so the kas-o can't pull it and panics
3. which kills the cert regeneration for that one cert
4. which is used by the kubelets
5. which causes kubelets to stop trusting masters
6. which causes cluster destruction
7. so you run a recovery tool,
8. which promptly fails on the missing information (edited)
9. with "normal" timing, the clusters will self destruct in 30 days

This doesn't even get into the MCO and it rippling out destructive changes to every node.
 

Pulls are open to land this change
https://github.com/openshift/api/pull/308
https://github.com/openshift/installer/pull/1718
https://github.com/openshift/cluster-kube-apiserver-operator/pull/464
https://github.com/openshift/machine-config-operator/pull/714
https://github.com/openshift/installer/pull/1727
https://github.com/openshift/cluster-kube-apiserver-operator/pull/465

Comment 1 W. Trevor King 2019-05-08 16:15:26 UTC

Of the above PRs:

* https://github.com/openshift/machine-config-operator/pull/715 (replacing mco#714) is in the merge queue, and probably isn't a blocker for forward migration anyway.
* https://github.com/openshift/cluster-kube-apiserver-operator/pull/464 is in the merge queue, I just kicked its tests now that installer#1718 has landed.
* https://github.com/openshift/installer/pull/1727 is still open, and is what David is concerned about upgrading around in the kube API-server operator.  I'm not as concerned about upgrading around this as he is, but I don't mind holding beta5 for this either.

Comment 2 W. Trevor King 2019-05-08 18:38:20 UTC

mco@715 landed.  installer#1727 is still in the queue, but a prereq installer#1730 has landed.  kao#464 still possibly blocked on installer#1727, although folks are banging away on retests there in case it isn't.

Comment 3 W. Trevor King 2019-05-08 18:58:08 UTC

I dunno why I cleared metadata before, I don't remember touching those fields.  Mike has already recovered the target, and now I'm recovering POSTness.  Sorry for the confusion :/

Comment 4 Eric Paris 2019-05-08 21:03:45 UTC

I believe this is the only PR left:
https://github.com/openshift/cluster-kube-apiserver-operator/pull/464

Comment 5 David Eads 2019-05-09 00:33:57 UTC

Now merged.  Updating to modified.

Comment 7 Scott Dodson 2019-05-09 18:34:52 UTC

QE, Ultimately the way this was fixed was to make sure that everything was updated prior to cutting Beta 5 so that any potential upgrade from Beta5 should be expected to work. We didn't actually implement an upgrade migration strategy.

Comment 8 Peter Ruan 2019-05-09 23:17:01 UTC

I was able to do an upgrade from `4.1.0-0.nightly-2019-05-09-182710` to `4.1.0-0.nightly-2019-05-09-204138`, but I'll leave it to Johnny to decide if this can be moved to VERIFIED.

Comment 9 Johnny Liu 2019-05-10 08:59:06 UTC

Based on comment 7 and comment 8, move this bug to 'VERIFIED'. Thanks for Peter's testing.

Comment 11 errata-xmlrpc 2019-06-04 10:48:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.