We see the Kube API to be unavailable during upgrades on GCP. This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly. Note: openshift-apiserver APIs are unavailable to if the kube-apiserver is not serving correctly. This is an umbrella bug, cloned into releases and closed when we are happy with the upgrade stability.
Search [1]. Example 4.6 CI job [2]. JUnit includes the following informer "failures": Kubernetes APIs remain available API was unreachable during disruption for at least 8s of 30m42s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Jun 09 10:44:38.405 E kube-apiserver Kube API started failing: etcdserver: leader changed Jun 09 10:44:39.401 E kube-apiserver Kube API is not responding to GET requests Jun 09 10:44:39.407 I kube-apiserver Kube API started responding to GET requests Jun 09 11:02:06.111 E kube-apiserver Kube API started failing: etcdserver: leader changed Jun 09 11:02:06.401 - 5s E kube-apiserver Kube API is not responding to GET requests Jun 09 11:02:12.561 I kube-apiserver Kube API started responding to GET requests OpenShift APIs remain available API was unreachable during disruption for at least 1s of 30m42s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Jun 09 11:04:10.126 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-7q68ilb3-2f611.origin-ci-int-gce.dev.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: unexpected EOF Jun 09 11:04:10.743 E openshift-apiserver OpenShift API is not responding to GET requests Jun 09 11:04:11.020 I openshift-apiserver OpenShift API started responding to GET requests Would be good to turn up an example where we actually failed CI on this, but I haven't found one in the handful I've spot-checked. If it did fail the test container, the failing test-case would be: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift] [1]: https://search.svc.ci.openshift.org/?name=release-openshift-.*gcp.*upgrade&search=API%20was%20unreachable%20during%20upgrade%20for%20at%20least [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/304
Oops, search should have been: https://search.svc.ci.openshift.org/?name=release-openshift-.*gcp.*upgrade&search=API%20was%20unreachable%20during%20disruption%20for%20at%20least
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Adding Upgradeblocker as the forked bug https://bugzilla.redhat.com/show_bug.cgi?id=1845416 has the label.
Removing Upgradeblocker as per https://bugzilla.redhat.com/show_bug.cgi?id=1845416#c5
*** Bug 1779938 has been marked as a duplicate of this bug. ***
Work in progress.
The "1-4 second failures" seem to be rooted in `etcdserver: leader changed` which is absolutely an expected thing to happen during upgrades. Are we missing something really simple like having the apiserver retry the request when it encounters that error?
Related to, and possibly caused by, bug 1870274, with the theory being something like: 1. etcd in the middle of a leader change (there may be some etcd downtime even for graceful, moveLeader [1] handoffs). 2. API server hits the leader-election error and does not perform the expected fast-retry (bug 1870274). 3. GCP's twitchy load balancer gets confused by the API server errors, since all API servers would be seeing etcd leaderelection issues at the same time. 4. A bit of chaos, taking a few LB-health-check cycles to recover. [1]: https://github.com/etcd-io/etcd/blob/facd0c946025f07ed8c1ba7d2bb2d80baa17c194/etcdserver/api/v3rpc/maintenance.go#L238
This is an umbrella bug for gcp API disruption. Labelling with UpcomingSprint.
*** Bug 1868741 has been marked as a duplicate of this bug. ***
This is an umbrella bug for API disruption. Labelling with UpcomingSprint.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.