1845410 – gcp: upgrade kube API disruption in CI

Bug 1845410 - gcp: upgrade kube API disruption in CI [NEEDINFO]

Summary: gcp: upgrade kube API disruption in CI

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:	LifecycleStale
Duplicates (2):	1779938 1868741 (view as bug list)
Depends On:
Blocks:	1845411 1845416 1869788 1869790
TreeView+	depends on / blocked

Reported:	2020-06-09 07:40 UTC by Stefan Schimanski
Modified:	2022-02-25 18:58 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1845411 1845412 1845416 1868741 (view as bug list)
Environment:
Last Closed:	2022-02-25 18:58:43 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Description Stefan Schimanski 2020-06-09 07:40:39 UTC

We see the Kube API to be unavailable during upgrades on

  GCP. 

This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly.

Note: openshift-apiserver APIs are unavailable to if the kube-apiserver is not serving correctly.

This is an umbrella bug, cloned into releases and closed when we are happy with the upgrade stability.

Comment 1 W. Trevor King 2020-06-09 19:32:16 UTC

Search [1].  Example 4.6 CI job [2].  JUnit includes the following informer "failures":

Kubernetes APIs remain available

API was unreachable during disruption for at least 8s of 30m42s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Jun 09 10:44:38.405 E kube-apiserver Kube API started failing: etcdserver: leader changed Jun 09 10:44:39.401 E kube-apiserver Kube API is not responding to GET requests Jun 09 10:44:39.407 I kube-apiserver Kube API started responding to GET requests Jun 09 11:02:06.111 E kube-apiserver Kube API started failing: etcdserver: leader changed Jun 09 11:02:06.401 - 5s E kube-apiserver Kube API is not responding to GET requests Jun 09 11:02:12.561 I kube-apiserver Kube API started responding to GET requests

OpenShift APIs remain available

API was unreachable during disruption for at least 1s of 30m42s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Jun 09 11:04:10.126 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-7q68ilb3-2f611.origin-ci-int-gce.dev.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: unexpected EOF Jun 09 11:04:10.743 E openshift-apiserver OpenShift API is not responding to GET requests Jun 09 11:04:11.020 I openshift-apiserver OpenShift API started responding to GET requests

Would be good to turn up an example where we actually failed CI on this, but I haven't found one in the handful I've spot-checked.  If it did fail the test container, the failing test-case would be:

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]

[1]: https://search.svc.ci.openshift.org/?name=release-openshift-.*gcp.*upgrade&search=API%20was%20unreachable%20during%20upgrade%20for%20at%20least
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/304

Comment 2 W. Trevor King 2020-06-09 19:36:52 UTC

Oops, search should have been: https://search.svc.ci.openshift.org/?name=release-openshift-.*gcp.*upgrade&search=API%20was%20unreachable%20during%20disruption%20for%20at%20least

Comment 3 Lalatendu Mohanty 2020-06-17 14:37:03 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 4 Lalatendu Mohanty 2020-06-17 14:37:57 UTC

Adding Upgradeblocker as the forked bug https://bugzilla.redhat.com/show_bug.cgi?id=1845416 has the label.

Comment 5 Lalatendu Mohanty 2020-06-17 14:39:19 UTC

Removing Upgradeblocker as per https://bugzilla.redhat.com/show_bug.cgi?id=1845416#c5

Comment 6 Stefan Schimanski 2020-06-18 10:03:16 UTC

*** Bug 1779938 has been marked as a duplicate of this bug. ***

Comment 7 Stefan Schimanski 2020-06-18 11:38:58 UTC

Work in progress.

Comment 12 Colin Walters 2020-08-24 20:55:31 UTC

The "1-4 second failures" seem to be rooted in `etcdserver: leader changed` which is absolutely an expected thing to happen during upgrades.

Are we missing something really simple like having the apiserver retry the request when it encounters that error?

Comment 13 W. Trevor King 2020-08-24 22:17:13 UTC

Related to, and possibly caused by, bug 1870274, with the theory being something like:

1. etcd in the middle of a leader change (there may be some etcd downtime even for graceful, moveLeader [1] handoffs).
2. API server hits the leader-election error and does not perform the expected fast-retry (bug 1870274).
3. GCP's twitchy load balancer gets confused by the API server errors, since all API servers would be seeing etcd leaderelection issues at the same time.
4. A bit of chaos, taking a few LB-health-check cycles to recover.

[1]: https://github.com/etcd-io/etcd/blob/facd0c946025f07ed8c1ba7d2bb2d80baa17c194/etcdserver/api/v3rpc/maintenance.go#L238

Comment 14 Stefan Schimanski 2020-09-11 14:46:43 UTC

This is an umbrella bug for gcp API disruption. Labelling with UpcomingSprint.

Comment 15 Stefan Schimanski 2020-09-11 15:13:48 UTC

*** Bug 1868741 has been marked as a duplicate of this bug. ***

Comment 16 Stefan Schimanski 2020-10-02 09:23:23 UTC

This is an umbrella bug for API disruption. Labelling with UpcomingSprint.

Comment 17 Michal Fojtik 2020-10-11 16:12:06 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 18 Michal Fojtik 2021-02-26 14:07:05 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 20 Michal Fojtik 2021-07-22 18:23:28 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 21 Michal Fojtik 2021-08-21 19:04:40 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Note You need to log in before you can comment on or make changes to this bug.