Bug 1845416

Summary: gcp: upgrade kube API disruption in CI – 4.5
Product: OpenShift Container Platform Reporter: Stefan Schimanski <sttts>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED CURRENTRELEASE QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: aos-bugs, lmohanty, mfojtik, timoran, wking, xxia
Target Milestone: ---Keywords: UpcomingSprint, Upgrades
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1845410 Environment:
Last Closed: 2020-06-19 06:44:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1845410, 1845903, 1847876, 1868741    
Bug Blocks: 1845411, 1869788, 1869790    

Description Stefan Schimanski 2020-06-09 07:46:36 UTC
We see unstable upgrades from 4.4 to 4.5, possibly because 4.4 has no ocp-routes fix.

+++ This bug was initially created as a clone of Bug #1845410 +++

We see the Kube API to be unavailable during upgrades on

  GCP. 

This is not supposed to happen if graceful termination and LB endpoint reconcialation by the cloud provider work correctly.

Note: openshift-apiserver APIs are unavailable to if the kube-apiserver is not serving correctly.

This is an umbrella bug, cloned into releases and closed when we are happy with the upgrade stability.

Comment 1 W. Trevor King 2020-06-09 19:35:01 UTC
Bug 1843928 is about backporting a gcp-routes fix to 4.4.  Are you suggesting this bug might be a dup of that one?

Comment 2 Stefan Schimanski 2020-06-10 13:58:07 UTC
@Trevor no. This is a follow-up of Bug 1843928.

Comment 3 Stefan Schimanski 2020-06-10 14:00:06 UTC
Wrong tab. Disregard my previous comment.

Bug 1843928 blocks this one.

Comment 4 Stefan Schimanski 2020-06-16 15:43:15 UTC
There is hope:

- 4.5->4.6 upgrades look good: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci
- 4.6->4.6 upgrades look good: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-origin-installer-e2e-gcp-upgrade-4.6
- 4.5->4.6 upgrades look ok'ish: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci
- 4.4->4.4 have green runs since Jun 16: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-gcp-upgrade-4.4, but was red ever since April 24 when `[4.4] Bug 1822603: GCP-routes: switch to using conntrack instead of route tables` https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/888 merged into RHCOS
- 4.4->4.5 is deeply red: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci

Conclusions:

- the revert of the revert plus fix of ^^ went into 4.4 with https://github.com/openshift/machine-config-operator/pull/1780 on Jun 11. So this matches with the observation ^^.
- the last stable 4.4 release named 4.4.8 is from Jun 11 as well and very probably has missed https://github.com/openshift/machine-config-operator/pull/1780 which could explain why the 4.4->4.5 upgrade test above, which used 4.4-stable as the base version, is deeply red.

Hence, we are waiting to 4.4.9 to be tagged to see whether we get improvement through https://github.com/openshift/machine-config-operator/pull/1780.

Comment 5 Stefan Schimanski 2020-06-17 10:08:53 UTC
tl/dr: this is about upgrades from 4.4 only as far as we can see, i.e. upgrade blocker "only".

Comment 6 Lalatendu Mohanty 2020-06-17 14:40:58 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 7 Scott Dodson 2020-06-17 15:00:11 UTC
Nevermind, this seems to be a tracking bug, unlinking the pr.

Comment 10 W. Trevor King 2020-06-17 17:15:38 UTC
> Nevermind, this seems to be a tracking bug, unlinking the pr.

And moving back to ASSIGNED...

Comment 14 Stefan Schimanski 2020-06-19 06:44:01 UTC
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci recovered to normal error rate. Closing this here.

Note that this is an umbrella bug linked the actual fixes. No QE necessary.

Comment 15 W. Trevor King 2021-04-05 17:47:47 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475