+++ This bug was initially created as a clone of Bug #1845410 +++ We see the Kube API to be unavailable during upgrades on GCP. This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly. Note: openshift-apiserver APIs are unavailable to if the kube-apiserver is not serving correctly. This is an umbrella bug, cloned into releases and closed when we are happy with the upgrade stability. --- Additional comment from W. Trevor King on 2020-06-09 19:32:16 UTC --- Search [1]. Example 4.6 CI job [2]. JUnit includes the following informer "failures": Kubernetes APIs remain available API was unreachable during disruption for at least 8s of 30m42s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Jun 09 10:44:38.405 E kube-apiserver Kube API started failing: etcdserver: leader changed Jun 09 10:44:39.401 E kube-apiserver Kube API is not responding to GET requests Jun 09 10:44:39.407 I kube-apiserver Kube API started responding to GET requests Jun 09 11:02:06.111 E kube-apiserver Kube API started failing: etcdserver: leader changed Jun 09 11:02:06.401 - 5s E kube-apiserver Kube API is not responding to GET requests Jun 09 11:02:12.561 I kube-apiserver Kube API started responding to GET requests OpenShift APIs remain available API was unreachable during disruption for at least 1s of 30m42s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Jun 09 11:04:10.126 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-7q68ilb3-2f611.origin-ci-int-gce.dev.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: unexpected EOF Jun 09 11:04:10.743 E openshift-apiserver OpenShift API is not responding to GET requests Jun 09 11:04:11.020 I openshift-apiserver OpenShift API started responding to GET requests Would be good to turn up an example where we actually failed CI on this, but I haven't found one in the handful I've spot-checked. If it did fail the test container, the failing test-case would be: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift] [1]: https://search.svc.ci.openshift.org/?name=release-openshift-.*gcp.*upgrade&search=API%20was%20unreachable%20during%20upgrade%20for%20at%20least [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/304 --- Additional comment from W. Trevor King on 2020-06-09 19:36:52 UTC --- Oops, search should have been: https://search.svc.ci.openshift.org/?name=release-openshift-.*gcp.*upgrade&search=API%20was%20unreachable%20during%20disruption%20for%20at%20least --- Additional comment from Lalatendu Mohanty on 2020-06-17 14:37:03 UTC --- We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 --- Additional comment from Lalatendu Mohanty on 2020-06-17 14:37:57 UTC --- Adding Upgradeblocker as the forked bug https://bugzilla.redhat.com/show_bug.cgi?id=1845416 has the label. --- Additional comment from Lalatendu Mohanty on 2020-06-17 14:39:19 UTC --- Removing Upgradeblocker as per https://bugzilla.redhat.com/show_bug.cgi?id=1845416#c5 --- Additional comment from Stefan Schimanski on 2020-06-18 10:03:16 UTC --- --- Additional comment from Stefan Schimanski on 2020-06-18 11:38:58 UTC --- Work in progress. --- Additional comment from Lalatendu Mohanty on 2020-06-23 12:23:13 UTC --- FYI, Lots of 4.5 jobs also have the same symptom https://search.apps.build01.ci.devcluster.openshift.com/?maxAge=168h&context=1&type=junit&maxMatches=5&maxBytes=20971520&groupBy=job&name=4.5.*azure|azure.*4.5&search=OpenShift+APIs+remain+available --- Additional comment from W. Trevor King on 2020-06-24 05:35:45 UTC --- (In reply to Lalatendu Mohanty from comment #8) > FYI, Lots of 4.5 jobs also have the same symptom > https://search.apps.build01.ci.devcluster.openshift.com/ > ?maxAge=168h&context=1&type=junit&maxMatches=5&maxBytes=20971520&groupBy=job& > name=4.5.*azure|azure.*4.5&search=OpenShift+APIs+remain+available For what it's worth, a number of those include the "this is not actually fatal" suffix. Like [1]: API was unreachable during disruption for at least 4s of 51m13s (0%), this is currently sufficient to pass the test/job but not considered completely correct: which passed. We'll want to tighten that down eventually, but the only things we need to be worried about for CI throughput are actual failures like [2]: Jun 23 18:19:00.354: API was unreachable during disruption for at least 6m21s of 56m35s (11%): In that case we also actually fail the job, with "Cluster should remain functional during upgrade" failing with: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 23 18:19:00.354: API was unreachable during disruption for at least 6m18s of 56m34s (11%): So excluding the passing tests drops us from: $ w3m -dump -cols 200 'https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+APIs+remain+available&maxAge=168h&context=1&type=junit&name=4.5.*azure%7Cazure.*4.5&maxMatches=5&maxBytes=20971520&groupBy=job' | grep 'failures ma tch' release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 13 runs, 85% failed, 200% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 12 runs, 75% failed, 178% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 13 runs, 69% failed, 233% of failures match to: $ w3m -dump -cols 200 'https://search.apps.build01.ci.devcluster.openshift.com/?search=fail.*API+was+unreachable+during+disruption&maxAge=168h&context=1&type=junit&name=4.5.*azure%7Cazure.*4.5&maxMatches=5&maxBytes=20971520&groupBy=job' | grep 'failures match' release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 31 runs, 84% failed, 31% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 31 runs, 71% failed, 50% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 31 runs, 65% failed, 40% of failures match Also, this ticket is about GCP, per the subject and comment 0. These searches are just looking at Azure. Going up to the cross-platform bug 1845411 and dropping down into Azure gets us to bug 1845414. But there's already some CI-search poking over in that bug, so probably not worth copying this stuff over. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci/1275282888492847104 [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci/1275464620366106624 --- Additional comment from Michal Fojtik on 2020-07-09 12:46:13 UTC --- Stefan is PTO, adding UpcomingSprint to his bugs to fulfill the duty. --- Additional comment from Stefan Schimanski on 2020-08-03 11:23:36 UTC --- WIP.
Seeing impact to Kube API during 4.4 -> 4.5 upgrade: API was unreachable during disruption for at least 4s of 42m24s (0%): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1293629892138635264 API was unreachable during disruption for at least 6s of 55m29s (0%): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1293539093396852736 API was unreachable during disruption for at least 2s of 40m30s (0%): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1293720758484209664
Umbrella bugs are used to collect different issues in one root. Don't clone them into releases. We have the umbrellas for a reason. N copies make it even harder to keep an overview of a number of different root-causes for the same symptoms. We backport fixes into older releases if they are feasible. We have clones for that.
*** This bug has been marked as a duplicate of bug 1845410 ***