Bug 1816302
Summary: | upgrade from 4.2.25 to 4.3.8 failed due to PodIP vs PodIPs handling in apiserver | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Peter Ruan <pruan> | |
Component: | kube-apiserver | Assignee: | Dan Winship <danw> | |
Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.3.0 | CC: | aos-bugs, ccoleman, danw, lmohanty, mfojtik, mfuruta, ohiroaki, rh-container, sdodson, vlaad, vrutkovs, wking, xxia | |
Target Milestone: | --- | Keywords: | Upgrades | |
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: An upstream bug prevented running clusters with a mix of kubernetes 1.14 and kubernetes 1.16 components.
Consequence: Users could not upgrade from 4.2 to 4.3.
Fix: Merged upstream bugfix
Result: OCP 4.3 is now compatible with OCP 4.2 nodes; upgrades can proceed.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1817039 (view as bug list) | Environment: | ||
Last Closed: | 2020-08-04 18:06:25 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1817039 |
Description
Peter Ruan
2020-03-23 18:47:22 UTC
Vadim Rutkovsky 1 hour ago so CVO is waiting for kube-apiserver-operator deployment to proceed - and kube-apiserver-operator-5d7c58bbb4-27mww is stuck in Error Dan Winship 1 hour ago um... kube-apiserver-operator-5d7c58bbb4-27mww is in state Error, but it has logs suggesting that it is running Dan Winship 1 hour ago and the pod status shows that it's in Error because the _previous_ instance of the pod was killed Dan Winship 1 hour ago oh, weird Dan Winship 1 hour ago Mar 23 18:04:42 ip-10-0-163-3 hyperkube[1938]: W0323 18:04:42.769863 1938 status_manager.go:519] Failed to update status for pod "kube-apiserver-operator-5d7c58bbb4-27mww_openshift-kube-apiserver-operator(93729130-6d20-11ea-a89d-0a193a986132)": failed to patch status "..." for pod "openshift-kube-apiserver-operator"/"kube-apiserver-operator-5d7c58bbb4-27mww": conversion Error: v1.PodIP(10.129.0.65) != v1.PodIPs[0](10.129.0.41) Dan Winship 32 minutes ago We have a 4.2 (1.14) kubelet and a 4.3 (1.16) apiserver. Kubelet is trying to update PodIP without touching PodIPs. Apiserver is trying to reconcile that and doing it incorrectly, I guess. Dan Winship 15 minutes ago oh yay, Jordan already fixed it We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.2.z and 4.3.1 (In reply to W. Trevor King from comment #2) Partial answers: > Who is impacted? Rare chance that this affects a cluster during major (4.2 -> 4.3) upgrade, when master kubelet is still on 1.14 and apiserver is already on 1.16 > What is the impact? Upgrade is stuck on 12% and won't proceed > How involved is remediation? Not sure if it can be resolved manually, seems > Admin must SSH to hosts, restore from backups, or other non standard admin > activities > Is this a regression? Upstream issue > No, it’s always been like this we just never noticed (In reply to Vadim Rutkovsky from comment #3) > (In reply to W. Trevor King from comment #2) > > Partial answers: > > > Who is impacted? > > Rare chance that this affects a cluster during major (4.2 -> 4.3) upgrade, > when master kubelet is still on 1.14 and apiserver is already on 1.16 I'm not sure we know it's rare. It hasn't been seen (by us) before, but there's no obvious reason why we shouldn't have, so it could be that something used to be protecting us from it but then it recently changed. Alternatively, we might have been erroneously counting it as some other failure mode. > > How involved is remediation? > > Not sure if it can be resolved manually, seems It seems that deleting the stuck pod (eg, in this case "openshift-kube-apiserver-operator"/"kube-apiserver-operator-5d7c58bbb4-27mww") gets the update going again. [1] is in flight to pull 4.2 -> 4.3 edges while we sort out whether this is a regression, or has existed for all previous 4.2 -> 4.3 update endpoints. Also not clear to me is whether the apparently straightforward recovery process (comment 4) means that we are ok helping impacted folks fix their clusters and want to drop UpgradeBlocker here. [1]: https://github.com/openshift/cincinnati-graph-data/pull/136 > It seems that deleting the stuck pod (eg, in this case "openshift-kube-apiserver-operator"/"kube-apiserver-operator-5d7c58bbb4-27mww") gets the update going again.
This seems to have helped a bit, but then later on the cluster stuck with a Pending etcd-quorum-guard Pod:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector.
So we're back to "no clear docs around a manual recovery process".
Additional pods stuck in terminating had to be deleted via `oc delete pod --force --grace-period 0` after which the upgrade ran to completion. Any scenario in which we have to force delete pods points to a serious regression and should block upgrades. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |