Bug 1845414

Summary: azure: upgrade kube API disruption
Product: OpenShift Container Platform Reporter: Stefan Schimanski <sttts>
Component: kube-apiserverAssignee: Fabiano Franz <ffranz>
Status: CLOSED NOTABUG QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: medium    
Version: 4.6CC: aos-bugs, dhellmann, fgiudici, jiazha, jlanford, mfojtik, miabbott, rvanderp, skumari, wking, wlewis, xxia
Target Milestone: ---Keywords: Upgrades
Target Release: ---Flags: mfojtik: needinfo?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tag-ci LifecycleStale
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1845412
: 1868735 (view as bug list) Environment:
[sig-api-machinery] OpenShift APIs remain available job=periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade=all
Last Closed: 2022-02-25 18:58:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1868158, 1881143    
Bug Blocks: 1845411, 1869788, 1869790, 1996881    

Description Stefan Schimanski 2020-06-09 07:45:00 UTC
+++ This bug was initially created as a clone of Bug #1845412 +++

We see the Kube API to be unavailable during upgrades on

  Azure. 

This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly.

Note: openshift-apiserver APIs are unavailable to if the kube-apiserver is not serving correctly.

This is an umbrella bug, cloned into releases and closed when we are happy with the upgrade stability.

Comment 1 W. Trevor King 2020-06-09 19:41:16 UTC
Search [1].  Example 4.6 CI job [2] failed on:

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun  8 11:25:56.402: API was unreachable during disruption for at least 5m52s of 35m41s (16%):...

Update JUnit [3] failed the underlying checks:

Kubernetes APIs remain available

Jun 8 11:25:56.402: API was unreachable during disruption for at least 5m52s of 35m41s (16%): Jun 08 10:57:49.398 E kube-apiserver Kube API started failing: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jun 08 10:57:50.398 - 102s E kube-apiserver Kube API is not responding to GET requests Jun 08 10:59:33.380 I kube-apiserver Kube API started responding to GET requests Jun 08 11:03:35.398 E kube-apiserver Kube API started failing: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jun 08 11:03:36.135 I kube-apiserver Kube API started responding to GET requests Jun 08 11:17:25.398 E kube-apiserver Kube API started failing: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/kube-system?timeout=15s: context deadline exceeded Jun 08 11:17:26.398 - 247s E kube-apiserver Kube API is not responding to GET requests Jun 08 11:21:33.472 I kube-apiserver Kube API started responding to GET requests

OpenShift APIs remain available

Jun 8 11:25:56.402: API was unreachable during disruption for at least 5m48s of 35m40s (16%): Jun 08 10:57:49.582 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded Jun 08 10:57:50.581 - 102s E openshift-apiserver OpenShift API is not responding to GET requests Jun 08 10:59:33.408 I openshift-apiserver OpenShift API started responding to GET requests Jun 08 11:03:35.582 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Jun 08 11:03:36.139 I openshift-apiserver OpenShift API started responding to GET requests Jun 08 11:17:25.582 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-jkk4yywi-d89b2.ci.azure.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Jun 08 11:17:26.581 - 246s E openshift-apiserver OpenShift API is not responding to GET requests Jun 08 11:21:33.481 I openshift-apiserver OpenShift API started responding to GET requests

[1]: https://search.svc.ci.openshift.org/?name=release-openshift-.*azure.*upgrade&search=API%20was%20unreachable%20during%20disruption%20for%20at%20least
[2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.6/300
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.6/300/artifacts/e2e-azure-upgrade/junit/junit_upgrade_1591615873.xml

Comment 2 Stefan Schimanski 2020-06-18 11:39:49 UTC
Work in progress.

Comment 5 Stefan Schimanski 2020-08-21 13:58:37 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1868158 will hopefully bring improvements for this umbrella bug.

Comment 6 Stefan Schimanski 2020-09-11 15:15:56 UTC
*** Bug 1868735 has been marked as a duplicate of this bug. ***

Comment 7 Stefan Schimanski 2020-09-11 15:19:53 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1868158 is on QA. We are watching how Azure improves (hopefully).

Comment 8 Stefan Schimanski 2020-10-02 09:22:51 UTC
WIP

Comment 9 Michal Fojtik 2020-11-01 10:12:06 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 10 Stefan Schimanski 2021-01-18 12:43:59 UTC
*** Bug 1916902 has been marked as a duplicate of this bug. ***

Comment 11 Michal Fojtik 2021-01-18 12:58:30 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 12 Neelesh Agrawal 2021-01-22 19:11:37 UTC
*** Bug 1915235 has been marked as a duplicate of this bug. ***

Comment 13 Xingxing Xia 2021-01-25 06:51:13 UTC
Bug 1915235 was closed as DUP of this bug, but it had the keywords, so adding the keywords to this bug.

(PS: I'm not sure if 1915235 is same issue, though. And I found http://virt-openshift-05.lab.eng.nay.redhat.com/buildcorp/upgrade_CI/9637/console reproduced 1915235 but when I rebuilt it with same matrix via https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/9930/console, can't reproduce 1915235, only hit bug 1919778)

Comment 14 Xingxing Xia 2021-01-26 03:02:46 UTC
Status update: bug 1915235 was reopened, bug 1915235#c15 comment was given and it was added MCO fix. Thus removing the keywords of bug 1915235 that were added to this bug in comment 13.

Comment 15 Francesco Giudici 2021-01-27 08:48:03 UTC
Added environment tracking for CI from bug #1916902, as it has been closed as duplicate of this

Comment 16 Michal Fojtik 2021-02-26 09:07:07 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 17 Doug Hellmann 2021-07-02 14:01:09 UTC
This problem is causing periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade to fail reliably.

Comment 18 Michal Fojtik 2021-07-02 14:13:30 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 19 Stefan Schimanski 2021-07-02 14:33:03 UTC
@Doug see my description:

  This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly.

If this is a problem of only Azure, and we has a Azure ticket open some time ago, it needs Splat team to engage there.

Comment 20 Michal Fojtik 2021-08-01 14:47:09 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 21 Stefan Schimanski 2021-08-25 10:37:54 UTC
*** Bug 1997057 has been marked as a duplicate of this bug. ***

Comment 22 Michal Fojtik 2021-08-25 11:15:25 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 23 Joe Lanford 2021-08-30 14:07:07 UTC
It appears that this bug is causing all PRs in `openshift/oc` to fail the `ci/prow/e2e-agnostic-cmd` test.

Will this be resolved in time for blocked BZs to merge before code freeze? Otherwise, it seems like we may need to selectively /override any blocked PRs that are unrelated to this test (e.g. my sqlite catalog deprecation warning: https://github.com/openshift/oc/pull/908)

Comment 24 Michal Fojtik 2021-09-29 14:30:07 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.