Bug 1868741 - [4.5] gcp: upgrade kube API disruption in CI
Summary: [4.5] gcp: upgrade kube API disruption in CI
Keywords:
Status: CLOSED DUPLICATE of bug 1845410
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.z
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks: 1845411 1845416 1869788 1869790
TreeView+ depends on / blocked
 
Reported: 2020-08-13 17:03 UTC by Micah Abbott
Modified: 2020-09-11 15:13 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1845410
Environment:
Kubernetes APIs remain available
Last Closed: 2020-09-11 14:59:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Micah Abbott 2020-08-13 17:03:36 UTC
+++ This bug was initially created as a clone of Bug #1845410 +++

We see the Kube API to be unavailable during upgrades on

  GCP. 

This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly.

Note: openshift-apiserver APIs are unavailable to if the kube-apiserver is not serving correctly.

This is an umbrella bug, cloned into releases and closed when we are happy with the upgrade stability.

--- Additional comment from W. Trevor King on 2020-06-09 19:32:16 UTC ---

Search [1].  Example 4.6 CI job [2].  JUnit includes the following informer "failures":

Kubernetes APIs remain available

API was unreachable during disruption for at least 8s of 30m42s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Jun 09 10:44:38.405 E kube-apiserver Kube API started failing: etcdserver: leader changed Jun 09 10:44:39.401 E kube-apiserver Kube API is not responding to GET requests Jun 09 10:44:39.407 I kube-apiserver Kube API started responding to GET requests Jun 09 11:02:06.111 E kube-apiserver Kube API started failing: etcdserver: leader changed Jun 09 11:02:06.401 - 5s E kube-apiserver Kube API is not responding to GET requests Jun 09 11:02:12.561 I kube-apiserver Kube API started responding to GET requests

OpenShift APIs remain available

API was unreachable during disruption for at least 1s of 30m42s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Jun 09 11:04:10.126 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-7q68ilb3-2f611.origin-ci-int-gce.dev.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: unexpected EOF Jun 09 11:04:10.743 E openshift-apiserver OpenShift API is not responding to GET requests Jun 09 11:04:11.020 I openshift-apiserver OpenShift API started responding to GET requests

Would be good to turn up an example where we actually failed CI on this, but I haven't found one in the handful I've spot-checked.  If it did fail the test container, the failing test-case would be:

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]

[1]: https://search.svc.ci.openshift.org/?name=release-openshift-.*gcp.*upgrade&search=API%20was%20unreachable%20during%20upgrade%20for%20at%20least
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/304

--- Additional comment from W. Trevor King on 2020-06-09 19:36:52 UTC ---

Oops, search should have been: https://search.svc.ci.openshift.org/?name=release-openshift-.*gcp.*upgrade&search=API%20was%20unreachable%20during%20disruption%20for%20at%20least

--- Additional comment from Lalatendu Mohanty on 2020-06-17 14:37:03 UTC ---

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

--- Additional comment from Lalatendu Mohanty on 2020-06-17 14:37:57 UTC ---

Adding Upgradeblocker as the forked bug https://bugzilla.redhat.com/show_bug.cgi?id=1845416 has the label.

--- Additional comment from Lalatendu Mohanty on 2020-06-17 14:39:19 UTC ---

Removing Upgradeblocker as per https://bugzilla.redhat.com/show_bug.cgi?id=1845416#c5

--- Additional comment from Stefan Schimanski on 2020-06-18 10:03:16 UTC ---



--- Additional comment from Stefan Schimanski on 2020-06-18 11:38:58 UTC ---

Work in progress.

--- Additional comment from Lalatendu Mohanty on 2020-06-23 12:23:13 UTC ---

FYI, Lots of 4.5 jobs also have the same symptom https://search.apps.build01.ci.devcluster.openshift.com/?maxAge=168h&context=1&type=junit&maxMatches=5&maxBytes=20971520&groupBy=job&name=4.5.*azure|azure.*4.5&search=OpenShift+APIs+remain+available

--- Additional comment from W. Trevor King on 2020-06-24 05:35:45 UTC ---

(In reply to Lalatendu Mohanty from comment #8)
> FYI, Lots of 4.5 jobs also have the same symptom
> https://search.apps.build01.ci.devcluster.openshift.com/
> ?maxAge=168h&context=1&type=junit&maxMatches=5&maxBytes=20971520&groupBy=job&
> name=4.5.*azure|azure.*4.5&search=OpenShift+APIs+remain+available

For what it's worth, a number of those include the "this is not actually fatal" suffix.  Like [1]:

  API was unreachable during disruption for at least 4s of 51m13s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

which passed.  We'll want to tighten that down eventually, but the only things we need to be worried about for CI throughput are actual failures like [2]:

  Jun 23 18:19:00.354: API was unreachable during disruption for at least 6m21s of 56m35s (11%):

In that case we also actually fail the job, with "Cluster should remain functional during upgrade" failing with:

  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 23 18:19:00.354: API was unreachable during disruption for at least 6m18s of 56m34s (11%):

So excluding the passing tests drops us from:

  $ w3m -dump -cols 200 'https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+APIs+remain+available&maxAge=168h&context=1&type=junit&name=4.5.*azure%7Cazure.*4.5&maxMatches=5&maxBytes=20971520&groupBy=job' | grep 'failures ma
tch'
  release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 13 runs, 85% failed, 200% of failures match
  release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 12 runs, 75% failed, 178% of failures match
  release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 13 runs, 69% failed, 233% of failures match

to:

  $ w3m -dump -cols 200 'https://search.apps.build01.ci.devcluster.openshift.com/?search=fail.*API+was+unreachable+during+disruption&maxAge=168h&context=1&type=junit&name=4.5.*azure%7Cazure.*4.5&maxMatches=5&maxBytes=20971520&groupBy=job' | grep 
'failures match'
  release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 31 runs, 84% failed, 31% of failures match
  release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci - 31 runs, 71% failed, 50% of failures match
  release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 31 runs, 65% failed, 40% of failures match

Also, this ticket is about GCP, per the subject and comment 0.  These searches are just looking at Azure.  Going up to the cross-platform bug 1845411 and dropping down into Azure gets us to bug 1845414.  But there's already some CI-search poking over in that bug, so probably not worth copying this stuff over.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci/1275282888492847104
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci/1275464620366106624

--- Additional comment from Michal Fojtik on 2020-07-09 12:46:13 UTC ---

Stefan is PTO, adding UpcomingSprint to his bugs to fulfill the duty.

--- Additional comment from Stefan Schimanski on 2020-08-03 11:23:36 UTC ---

WIP.

Comment 1 Micah Abbott 2020-08-13 17:31:01 UTC
Seeing impact to Kube API during 4.4 -> 4.5 upgrade:


API was unreachable during disruption for at least 4s of 42m24s (0%):

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1293629892138635264


API was unreachable during disruption for at least 6s of 55m29s (0%):

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1293539093396852736


API was unreachable during disruption for at least 2s of 40m30s (0%):

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1293720758484209664

Comment 3 Stefan Schimanski 2020-09-11 14:59:26 UTC
Umbrella bugs are used to collect different issues in one root. Don't clone them into releases. We have the umbrellas for a reason. N copies make it even harder to keep an overview of a number of different root-causes for the same symptoms.

We backport fixes into older releases if they are feasible. We have clones for that.

Comment 4 Stefan Schimanski 2020-09-11 15:13:47 UTC

*** This bug has been marked as a duplicate of bug 1845410 ***


Note You need to log in before you can comment on or make changes to this bug.