1845412 – aws: upgrade kube API disruption

Bug 1845412 - aws: upgrade kube API disruption [NEEDINFO]

Summary: aws: upgrade kube API disruption

Keywords:
Status:	CLOSED DUPLICATE of bug 1943804
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:	LifecycleReset
Duplicates (2):	1801885 1868496 (view as bug list)
Depends On:
Blocks:	1791162 1801885 1845411 1869788 1869790
TreeView+	depends on / blocked

Reported:	2020-06-09 07:43 UTC by Stefan Schimanski
Modified:	2021-06-08 14:56 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1845410
Clones:	1845414 (view as bug list)
Environment:
Last Closed:	2021-06-08 14:56:19 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Description Stefan Schimanski 2020-06-09 07:43:03 UTC

We see the Kube API to be unavailable during upgrades on

  AWS. 

This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly.

Note: openshift-apiserver APIs are unavailable to if the kube-apiserver is not serving correctly.

This is an umbrella bug, cloned into releases and closed when we are happy with the upgrade stability.

Comment 1 W. Trevor King 2020-06-09 20:04:15 UTC

Bug 1801885 gave [1] as a 4.3.1 -> 4.4.0-0.ci-2020-02-11-153441 example that failed on:

[Disruptive] Cluster upgrade [Top Level] [Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Serial] [Suite:openshift]

with:

fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 11 17:07:56.230: API was unreachable during upgrade for at least 2m3s:

The error message has evolved since, with "during upgrade" -> "during disruption".  CI search turns up a number of recent hits [2], but the bulk are in release-openshift-origin-installer-e2e-aws-upgrade which is used by cluster-bot for update tests launched with all sorts of source and target versions.  I don't see anything serious that's obviously 4.6-specific:

$ w3m -dump -cols 200 'https://search.svc.ci.openshift.org/?name=release-openshift-.*aws.*upgrade&search=API%20was%20unreachable%20during%20disruption%20for%20at%20least' | grep upgrade
release-openshift-origin-installer-e2e-aws-upgrade - 384 runs, 41% failed, 66% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3 - 2 runs, 0% failed, 50% of runs match
release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.4-ci - 2 runs, 0% failed, 50% of runs match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 - 3 runs, 100% failed, 67% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 4 runs, 0% failed, 75% of runs match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 4 runs, 0% failed, 75% of runs match
release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci - 12 runs, 42% failed, 80% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly - 3 runs, 100% failed, 67% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 32 runs, 28% failed, 189% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.4 - 1 runs, 0% failed, 100% of runs match
release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly - 3 runs, 100% failed, 33% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3-to-4.4-to-4.5-ci - 3 runs, 100% failed, 33% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4 - 3 runs, 0% failed, 33% of runs match

And even then, a fair number of those hits are the non-fatal informer flavor, with:

  ...this is currently sufficient to pass the test/job but not considered completely correct...

Would be good to have folks link a 4.6 job that failed on this, if this is in fact still happening.

[1] https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17130
[2]: https://search.svc.ci.openshift.org/?name=release-openshift-.*aws.*upgrade&search=API%20was%20unreachable%20during%20disruption%20for%20at%20least

Comment 2 Lukasz Szaszkiewicz 2020-06-18 10:23:22 UTC

*** Bug 1801885 has been marked as a duplicate of this bug. ***

Comment 3 Stefan Schimanski 2020-06-18 11:39:37 UTC

Work in progress.

Comment 6 Stefan Schimanski 2020-08-21 11:48:26 UTC

*** Bug 1865857 has been marked as a duplicate of this bug. ***

Comment 7 Stefan Schimanski 2020-08-28 08:15:31 UTC

*** Bug 1868496 has been marked as a duplicate of this bug. ***

Comment 8 Stefan Schimanski 2020-09-11 14:47:06 UTC

This is an umbrella bug for aws API disruption. Labelling with UpcomingSprint.

Comment 9 Michal Fojtik 2020-09-27 09:02:56 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 10 Michal Fojtik 2021-02-26 14:07:09 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 11 W. Trevor King 2021-06-03 14:30:43 UTC

Should this be closed as a dup of bug 1943804?

Comment 12 Michal Fojtik 2021-06-03 15:29:18 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 13 Michal Fojtik 2021-06-08 14:56:19 UTC


*** This bug has been marked as a duplicate of bug 1943804 ***

Note You need to log in before you can comment on or make changes to this bug.