Bug 1817039

Summary:	upgrade from 4.2.25 to 4.3.8 failed due to PodIP vs PodIPs handling in apiserver
Product:	OpenShift Container Platform	Reporter:	Scott Dodson <sdodson>
Component:	kube-apiserver	Assignee:	Dan Winship <danw>
Status:	CLOSED ERRATA	QA Contact:	Ke Wang <kewang>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.4	CC:	aos-bugs, ccoleman, danw, lmohanty, mfojtik, pruan, rh-container, sdodson, vlaad, vrutkovs, wking, xxia
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: An upstream bug prevented running clusters with a mix of kubernetes 1.14 and kubernetes 1.16 components. Consequence: Users could not upgrade from 4.2 to 4.3. Fix: Merged upstream bugfix Result: OCP 4.3 is now compatible with OCP 4.2 nodes; upgrades can proceed. (The fix to OCP 4.4 was mostly irrelevant since we don't support direct 4.2 -> 4.4 upgrades, but 4.4 got fixed in the process of backporting the fix to 4.3.)	Story Points:	---
Clone Of:	1816302
Clones:	1817040 (view as bug list)		Environment:
Last Closed:	2020-05-04 11:47:20 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1816302
Bug Blocks:	1817040

Description Scott Dodson 2020-03-25 13:11:47 UTC

+++ This bug was initially created as a clone of Bug #1816302 +++

Description of problem:
 I'm a 4.2.25 cluster that I'm trying to upgrade to 4.3.8.  It's pretty much out-of-the-box with the exception that I've enabled autoscaling.  The upgrade failed due to DNS operator is degraded



Version-Release number of selected component (if applicable):

4.2.25
How reproducible:


Steps to Reproduce:
1. spin up a 4.2.25 cluster
2. enable autoscaling
3. upgrade to 4.3.8

Actual results:
pruan@MacBook-Pro ~/junk/upgrade $ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.25    True        True          145m    Working towards 4.3.8: 12% complete
pruan@MacBook-Pro ~/junk/upgrade $ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.8     True        False         False      4h10m
cloud-credential                           4.3.8     True        False         False      4h25m
cluster-autoscaler                         4.3.8     True        False         False      4h16m
console                                    4.3.8     True        False         False      134m
dns                                        4.2.25    True        True          False      4h25m
image-registry                             4.3.8     True        False         False      4h16m
ingress                                    4.3.8     True        False         False      126m
insights                                   4.3.8     True        False         False      4h25m
kube-apiserver                             4.3.8     True        False         False      4h23m
kube-controller-manager                    4.3.8     True        False         False      4h24m
kube-scheduler                             4.3.8     True        False         False      4h23m
machine-api                                4.3.8     True        False         False      4h25m
machine-config                             4.2.25    True        False         False      4h25m
marketplace                                4.3.8     True        False         False      136m
monitoring                                 4.3.8     True        False         False      133m
network                                    4.3.8     True        False         False      4h24m
node-tuning                                4.3.8     True        False         False      138m
openshift-apiserver                        4.3.8     True        False         False      124m
openshift-controller-manager               4.3.8     True        False         False      4h24m
openshift-samples                          4.3.8     True        False         False      128m
operator-lifecycle-manager                 4.3.8     True        False         False      4h24m
operator-lifecycle-manager-catalog         4.3.8     True        False         False      4h24m
operator-lifecycle-manager-packageserver   4.3.8     True        False         False      136m
service-ca                                 4.3.8     True        False         False      4h25m
service-catalog-apiserver                  4.3.8     True        False         False      4h22m
service-catalog-controller-manager         4.3.8     True        False         False      4h22m
storage                                    4.3.8     True        False         False      138mExpected results:


Additional info:
Must gather log is too big for attachment.  Posted in here instead http://file.rdu.redhat.com/pruan/bugs/upgrade.tgz

--- Additional comment from Dan Winship on 2020-03-23 15:09:47 EDT ---

Vadim Rutkovsky  1 hour ago
so CVO is waiting for kube-apiserver-operator deployment to proceed - and kube-apiserver-operator-5d7c58bbb4-27mww is stuck in Error

Dan Winship  1 hour ago
um... kube-apiserver-operator-5d7c58bbb4-27mww is in state Error, but it has logs suggesting that it is running

Dan Winship  1 hour ago
and the pod status shows that it's in Error because the _previous_ instance of the pod was killed

Dan Winship  1 hour ago
oh, weird

Dan Winship  1 hour ago
Mar 23 18:04:42 ip-10-0-163-3 hyperkube[1938]: W0323 18:04:42.769863    1938 status_manager.go:519] Failed to update status for pod "kube-apiserver-operator-5d7c58bbb4-27mww_openshift-kube-apiserver-operator(93729130-6d20-11ea-a89d-0a193a986132)": failed to patch status "..." for pod "openshift-kube-apiserver-operator"/"kube-apiserver-operator-5d7c58bbb4-27mww": conversion Error: v1.PodIP(10.129.0.65) != v1.PodIPs[0](10.129.0.41)

Dan Winship  32 minutes ago
We have a 4.2 (1.14) kubelet and a 4.3 (1.16) apiserver. Kubelet is trying to update PodIP without touching PodIPs. Apiserver is trying to reconcile that and doing it incorrectly, I guess.

Dan Winship  15 minutes ago
oh yay, Jordan already fixed it

--- Additional comment from W. Trevor King on 2020-03-23 15:36:19 EDT ---

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2.z and 4.3.1

--- Additional comment from Vadim Rutkovsky on 2020-03-23 16:47:46 EDT ---

(In reply to W. Trevor King from comment #2)

Partial answers:

> Who is impacted?

Rare chance that this affects a cluster during major (4.2 -> 4.3) upgrade, when master kubelet is still on 1.14 and apiserver is already on 1.16

> What is the impact?

Upgrade is stuck on 12% and won't proceed

> How involved is remediation?

Not sure if it can be resolved manually, seems

>   Admin must SSH to hosts, restore from backups, or other non standard admin
> activities


> Is this a regression?

Upstream issue
>   No, it’s always been like this we just never noticed

--- Additional comment from Dan Winship on 2020-03-23 17:29:43 EDT ---

(In reply to Vadim Rutkovsky from comment #3)
> (In reply to W. Trevor King from comment #2)
> 
> Partial answers:
> 
> > Who is impacted?
> 
> Rare chance that this affects a cluster during major (4.2 -> 4.3) upgrade,
> when master kubelet is still on 1.14 and apiserver is already on 1.16

I'm not sure we know it's rare. It hasn't been seen (by us) before, but there's no obvious reason why we shouldn't have, so it could be that something used to be protecting us from it but then it recently changed. Alternatively, we might have been erroneously counting it as some other failure mode.

> > How involved is remediation?
> 
> Not sure if it can be resolved manually, seems

It seems that deleting the stuck pod (eg, in this case "openshift-kube-apiserver-operator"/"kube-apiserver-operator-5d7c58bbb4-27mww") gets the update going again.

--- Additional comment from W. Trevor King on 2020-03-23 18:05:05 EDT ---

[1] is in flight to pull 4.2 -> 4.3 edges while we sort out whether this is a regression, or has existed for all previous 4.2 -> 4.3 update endpoints.  Also not clear to me is whether the apparently straightforward recovery process (comment 4) means that we are ok helping impacted folks fix their clusters and want to drop UpgradeBlocker here.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/136

--- Additional comment from W. Trevor King on 2020-03-23 19:49:50 EDT ---

> It seems that deleting the stuck pod (eg, in this case "openshift-kube-apiserver-operator"/"kube-apiserver-operator-5d7c58bbb4-27mww") gets the update going again.

This seems to have helped a bit, but then later on the cluster stuck with a Pending etcd-quorum-guard Pod:

  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   <unknown>          default-scheduler   0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector.

So we're back to "no clear docs around a manual recovery process".

--- Additional comment from Scott Dodson on 2020-03-23 20:09:16 EDT ---

Additional pods stuck in terminating had to be deleted via `oc delete pod --force --grace-period 0` after which the upgrade ran to completion.

--- Additional comment from Clayton Coleman on 2020-03-23 20:12:22 EDT ---

Any scenario in which we have to force delete pods points to a serious regression and should block upgrades.

Comment 9 errata-xmlrpc 2020-05-04 11:47:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581