Bug 1868735 - [4.5] azure: upgrade kube API disruption
Summary: [4.5] azure: upgrade kube API disruption
Keywords:
Status: CLOSED DUPLICATE of bug 1845414
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.z
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks: 1845411 1869788 1869790
TreeView+ depends on / blocked
 
Reported: 2020-08-13 16:05 UTC by Micah Abbott
Modified: 2020-09-11 15:15 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1845414
Environment:
OpenShift APIs remain available Kubernetes APIs remain available
Last Closed: 2020-09-11 15:15:55 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Micah Abbott 2020-08-13 16:05:49 UTC
+++ This bug was initially created as a clone of Bug #1845414 +++

+++ This bug was initially created as a clone of Bug #1845412 +++

We see the Kube API to be unavailable during upgrades on

  Azure. 

This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly.

Note: openshift-apiserver APIs are unavailable to if the kube-apiserver is not serving correctly.

This is an umbrella bug, cloned into releases and closed when we are happy with the upgrade stability.

--- Additional comment from W. Trevor King on 2020-06-09 19:41:16 UTC ---

Search [1].  Example 4.6 CI job [2] failed on:

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun  8 11:25:56.402: API was unreachable during disruption for at least 5m52s of 35m41s (16%):...

Update JUnit [3] failed the underlying checks:

Kubernetes APIs remain available

Jun 8 11:25:56.402: API was unreachable during disruption for at least 5m52s of 35m41s (16%): Jun 08 10:57:49.398 E kube-apiserver Kube API started failing: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jun 08 10:57:50.398 - 102s E kube-apiserver Kube API is not responding to GET requests Jun 08 10:59:33.380 I kube-apiserver Kube API started responding to GET requests Jun 08 11:03:35.398 E kube-apiserver Kube API started failing: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jun 08 11:03:36.135 I kube-apiserver Kube API started responding to GET requests Jun 08 11:17:25.398 E kube-apiserver Kube API started failing: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=15s: context deadline exceeded Jun 08 11:17:26.398 - 247s E kube-apiserver Kube API is not responding to GET requests Jun 08 11:21:33.472 I kube-apiserver Kube API started responding to GET requests

OpenShift APIs remain available

Jun 8 11:25:56.402: API was unreachable during disruption for at least 5m48s of 35m40s (16%): Jun 08 10:57:49.582 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded Jun 08 10:57:50.581 - 102s E openshift-apiserver OpenShift API is not responding to GET requests Jun 08 10:59:33.408 I openshift-apiserver OpenShift API started responding to GET requests Jun 08 11:03:35.582 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Jun 08 11:03:36.139 I openshift-apiserver OpenShift API started responding to GET requests Jun 08 11:17:25.582 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jun 08 11:17:26.581 - 246s E openshift-apiserver OpenShift API is not responding to GET requests Jun 08 11:21:33.481 I openshift-apiserver OpenShift API started responding to GET requests

[1]: https://search.svc.ci.openshift.org/?name=release-openshift-.*azure.*upgrade&search=API%20was%20unreachable%20during%20disruption%20for%20at%20least
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.6/300
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.6/300/artifacts/e2e-azure-upgrade/junit/junit_upgrade_1591615873.xml

--- Additional comment from Stefan Schimanski on 2020-06-18 11:39:49 UTC ---

Work in progress.

--- Additional comment from Michal Fojtik on 2020-07-09 12:46:07 UTC ---

Stefan is PTO, adding UpcomingSprint to his bugs to fulfill the duty.

--- Additional comment from Stefan Schimanski on 2020-08-03 11:24:19 UTC ---

WIP.

Comment 1 Micah Abbott 2020-08-13 16:10:25 UTC
We are seeing the Kubernetes + Openshift APIs being impacted in 4.4 -> 4.5 upgrade tests on Azure:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/1293901964882481152

# Kubernetes APIs remain available
API was unreachable during disruption for at least 22s of 51m44s (1%):

# OpenShift APIs remain available
API was unreachable during disruption for at least 8s of 51m44s (0%):

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/1293720254274342912

# Kubernetes APIs remain available
API was unreachable during disruption for at least 28s of 53m37s (1%):

# OpenShift APIs remain available
API was unreachable during disruption for at least 17s of 53m37s (1%):

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/1293629390785089536

# Kubernetes APIs remain available
API was unreachable during disruption for at least 3m49s of 55m59s (7%):

# OpenShift APIs remain available
API was unreachable during disruption for at least 3m50s of 55m59s (7%):

Comment 3 Stefan Schimanski 2020-09-11 15:15:55 UTC
Umbrella bugs are used to collect different issues in one root. Don't clone them into releases. We have the umbrellas for a reason. N copies make it even harder to keep an overview of a number of different root-causes for the same symptoms.

We backport fixes into older releases if they are feasible. We have clones for that.

*** This bug has been marked as a duplicate of bug 1845414 ***


Note You need to log in before you can comment on or make changes to this bug.