Bug 1868735

Summary: [4.5] azure: upgrade kube API disruption
Product: OpenShift Container Platform Reporter: Micah Abbott <miabbott>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED DUPLICATE QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: aos-bugs, dosmith, mfojtik, sttts, wking, xxia
Target Milestone: ---Keywords: Upgrades
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1845414 Environment:
OpenShift APIs remain available Kubernetes APIs remain available
Last Closed: 2020-09-11 15:15:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1845411, 1869788, 1869790    

Description Micah Abbott 2020-08-13 16:05:49 UTC
+++ This bug was initially created as a clone of Bug #1845414 +++

+++ This bug was initially created as a clone of Bug #1845412 +++

We see the Kube API to be unavailable during upgrades on

  Azure. 

This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly.

Note: openshift-apiserver APIs are unavailable to if the kube-apiserver is not serving correctly.

This is an umbrella bug, cloned into releases and closed when we are happy with the upgrade stability.

--- Additional comment from W. Trevor King on 2020-06-09 19:41:16 UTC ---

Search [1].  Example 4.6 CI job [2] failed on:

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun  8 11:25:56.402: API was unreachable during disruption for at least 5m52s of 35m41s (16%):...

Update JUnit [3] failed the underlying checks:

Kubernetes APIs remain available

Jun 8 11:25:56.402: API was unreachable during disruption for at least 5m52s of 35m41s (16%): Jun 08 10:57:49.398 E kube-apiserver Kube API started failing: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jun 08 10:57:50.398 - 102s E kube-apiserver Kube API is not responding to GET requests Jun 08 10:59:33.380 I kube-apiserver Kube API started responding to GET requests Jun 08 11:03:35.398 E kube-apiserver Kube API started failing: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jun 08 11:03:36.135 I kube-apiserver Kube API started responding to GET requests Jun 08 11:17:25.398 E kube-apiserver Kube API started failing: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=15s: context deadline exceeded Jun 08 11:17:26.398 - 247s E kube-apiserver Kube API is not responding to GET requests Jun 08 11:21:33.472 I kube-apiserver Kube API started responding to GET requests

OpenShift APIs remain available

Jun 8 11:25:56.402: API was unreachable during disruption for at least 5m48s of 35m40s (16%): Jun 08 10:57:49.582 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded Jun 08 10:57:50.581 - 102s E openshift-apiserver OpenShift API is not responding to GET requests Jun 08 10:59:33.408 I openshift-apiserver OpenShift API started responding to GET requests Jun 08 11:03:35.582 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Jun 08 11:03:36.139 I openshift-apiserver OpenShift API started responding to GET requests Jun 08 11:17:25.582 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jun 08 11:17:26.581 - 246s E openshift-apiserver OpenShift API is not responding to GET requests Jun 08 11:21:33.481 I openshift-apiserver OpenShift API started responding to GET requests

[1]: https://search.svc.ci.openshift.org/?name=release-openshift-.*azure.*upgrade&search=API%20was%20unreachable%20during%20disruption%20for%20at%20least
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.6/300
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.6/300/artifacts/e2e-azure-upgrade/junit/junit_upgrade_1591615873.xml

--- Additional comment from Stefan Schimanski on 2020-06-18 11:39:49 UTC ---

Work in progress.

--- Additional comment from Michal Fojtik on 2020-07-09 12:46:07 UTC ---

Stefan is PTO, adding UpcomingSprint to his bugs to fulfill the duty.

--- Additional comment from Stefan Schimanski on 2020-08-03 11:24:19 UTC ---

WIP.

Comment 1 Micah Abbott 2020-08-13 16:10:25 UTC
We are seeing the Kubernetes + Openshift APIs being impacted in 4.4 -> 4.5 upgrade tests on Azure:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/1293901964882481152

# Kubernetes APIs remain available
API was unreachable during disruption for at least 22s of 51m44s (1%):

# OpenShift APIs remain available
API was unreachable during disruption for at least 8s of 51m44s (0%):

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/1293720254274342912

# Kubernetes APIs remain available
API was unreachable during disruption for at least 28s of 53m37s (1%):

# OpenShift APIs remain available
API was unreachable during disruption for at least 17s of 53m37s (1%):

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/1293629390785089536

# Kubernetes APIs remain available
API was unreachable during disruption for at least 3m49s of 55m59s (7%):

# OpenShift APIs remain available
API was unreachable during disruption for at least 3m50s of 55m59s (7%):

Comment 3 Stefan Schimanski 2020-09-11 15:15:55 UTC
Umbrella bugs are used to collect different issues in one root. Don't clone them into releases. We have the umbrellas for a reason. N copies make it even harder to keep an overview of a number of different root-causes for the same symptoms.

We backport fixes into older releases if they are feasible. We have clones for that.

*** This bug has been marked as a duplicate of bug 1845414 ***