1794169 – All routes, including route for OpenShift Console become unavailable for long time during z-stream upgrade [openshift-4.2.z]

Bug 1794169 - All routes, including route for OpenShift Console become unavailable for long time during z-stream upgrade [openshift-4.2.z]

Summary: All routes, including route for OpenShift Console become unavailable for long...

Keywords:
Status:	CLOSED DUPLICATE of bug 1809667
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Dan Mace
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:	1785457 1809667
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-22 19:59 UTC by Tejas Parikh
Modified:	2024-12-20 18:58 UTC (History)
CC List:	23 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	ClusterID: 0e802cc2-596f-4bac-b848-8f8c84c8eb9e Cluster URL: https://cloud.redhat.com/openshift/details/19rfif7gh0koji6h7q8c8q7id9i9puup
Last Closed:	2020-03-06 14:19:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Cluster Upgrade 3scale downtime (292.58 KB, image/png) 2020-02-10 14:43 UTC, David Ffrench	no flags	Details
Cluster Upgrade AMQ Online downtime (91.69 KB, image/png) 2020-02-10 14:44 UTC, David Ffrench	no flags	Details
openshift-ingress namespace events 12:04 (342.23 KB, image/png) 2020-02-10 14:45 UTC, David Ffrench	no flags	Details
openshift-ingress namespace events 12:11 (384.05 KB, image/png) 2020-02-10 14:46 UTC, David Ffrench	no flags	Details
View All

Description Tejas Parikh 2020-01-22 19:59:06 UTC

Description of problem:

OpenShift Console becomes unavailable for long time (~30 minutes) during z-stream upgrade of OpenShift Dedicated cluster with 44 nodes (6 master, 2 infra, 36 compute).

Version-Release number of selected component (if applicable):

OCP 4.2.12

How reproducible:


Steps to Reproduce:
1. Setup a OCP cluster with 44 nodes.
2. Start z-stream upgrade.
3. Try accessing cluster console during the upgrade process.

Actual results:

Cluster console is unavailable for ~30 minutes.

Expected results:

Cluster console should be accessible during cluster upgrade.

Additional info:

Without cluster console access, we cannot get cli tokens for OpenShift.

Comment 1 bpeterse 2020-01-27 20:09:13 UTC

Can you provide further information on this?  For example, what is the reporting from Console Operator, Authentication Operator or Ingress Operator?  Is there any information from the Console Pod log?

Comment 2 Nick Stielau 2020-01-27 21:39:56 UTC

Are you able to reproduce this latency for the 'oc' command line as well?  I'm wondering if it is actually the API server itself.

Also, is 6 masters the usual configuration for a cluster of this size?

Comment 3 Tejas Parikh 2020-01-28 14:20:02 UTC

Sorry it is 3 master nodes and not 6.

I am not able to provide any logs from the cluster because tenants of OpenShift Dedicated do not have access to them. However, OpenShift Dedicated SRE team may be able to provide any logs you need.

Comment 4 bpeterse 2020-01-28 15:51:16 UTC

Ok great, if you can connect with them about the cluster you were working on and get us some more info, that will help us know what to do.  Thanks!

Comment 6 bpeterse 2020-01-29 14:31:04 UTC

Both operators have hundreds of lines of this in the logs:

W0128 20:03:44.115364       1 reflector.go:289] github.com/openshift/client-go/route/informers/externalversions/factory.go:101: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
W0128 20:03:47.445763       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 40979893 (40984697)
W0128 20:04:17.985485       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 40981817 (40984964)
W0128 20:06:09.424946       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.Deployment ended with: too old resource version: 40978581 (40981175)
W0128 20:06:24.388878       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 40983221 (40986137)
W0128 20:06:55.448181       1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 40983722 (40986413)

This is a normal message to have occasionally, it is not normal to have a constant stream.
Is the API happy?

Comment 7 bpeterse 2020-01-29 22:44:12 UTC

Marking "needs info" again as we don't have anything definitive at this point, other than it seems like the operators are not able to reach the API for an extended amount of time.

tparikh gathering more information on the next upgrade (tues) seems like the best path forward.

Comment 8 Naveen Malik 2020-01-30 16:36:32 UTC

bpeterse what specific info would you collect as a part of the next upgrade?  We can grab a must-gather after the upgrade, anything else that would be helpful?

Comment 10 David Ffrench 2020-02-10 14:41:59 UTC

Hi All,

I am seeing a slightly different issue but related. I am seeing 100% downtime for all routes during an OpenShift upgrade, during what looks to be the router redeployment and image change.

What have I tested?
I have been running external performance tests for both 3scale[1] and AMQ Online[2] against an RHMI OSD staging cluster. The performance tests are not testing load but sending requests at a consistent rate to determine downtime during an upgrade.

How did I upgrade the cluster?
Normal upgrade from 4.2.8 to 4.2.9 using "oc adm upgrade --to 4.2.8"

What would I like you to test?
Naveen asked in the last comment, what info would be good to gather during an upgrade. I would like to see a similar test run against a cluster that is being upgraded to see if there is downtime due to the router when the cluster is upgraded.

Note: See AMQ Online and 3scale attachments to see downtime
Note: Events from the two timeframes have been added as screenshots




[1] https://github.com/3scale/perftest-toolkit
[2] https://github.com/EnMasseProject/external-test-clients

Comment 11 David Ffrench 2020-02-10 14:43:56 UTC

Created attachment 1662156 [details]
Cluster Upgrade 3scale downtime

Comment 12 David Ffrench 2020-02-10 14:44:36 UTC

Created attachment 1662157 [details]
Cluster Upgrade AMQ Online downtime

Comment 13 David Ffrench 2020-02-10 14:45:24 UTC

Created attachment 1662158 [details]
openshift-ingress namespace events 12:04

Comment 14 David Ffrench 2020-02-10 14:46:04 UTC

Created attachment 1662159 [details]
openshift-ingress namespace events 12:11

Comment 15 bpeterse 2020-02-10 16:55:52 UTC

This is no longer a console issue, passing it over to Routing.

Comment 16 David Ffrench 2020-02-11 08:49:47 UTC

dmace This is a critical issue, it means downtime during every OSD 4.x upgrade which happens weekly. Could you tell me when it will be looked into? I am free to talk about the issue over BJ also at anytime.

Comment 17 Dan Mace 2020-02-11 14:44:47 UTC

Ben and Clayton, is this a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1785457?

Comment 18 Dan Mace 2020-02-11 15:58:18 UTC

Spoke with Clayton, and we decided to be conservative and keep this bug for now just in case the SDN fixes coming for 1785457 don't account for all the reports here.

It's possible that once the SDN dust has settled we'll be more clearly faced with some known potential disruptions with ingress specifically which we know are theoretically possible since 4.1 but for which we haven't fully addressed yet (although we've made significant progress).

The ingress disruptions I'm referring to are scoped very narrowly to disruptions during ingress-controller rollout[1][2][3], while the SDN issues discovered in https://bugzilla.redhat.com/show_bug.cgi?id=1785457 impact ingress indirectly.

[1] https://issues.redhat.com/browse/NE-203
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1709958
[3] https://docs.google.com/document/d/1GP17EBWb2bj4fz7dr3QUxK8leZTPD7oCNr9npLDF1wI/edit#heading=h.exa2qjxyht92

Comment 20 Ben Bennett 2020-03-03 16:22:38 UTC

*** Bug 1805690 has been marked as a duplicate of this bug. ***

Comment 21 Dan Mace 2020-03-03 19:28:55 UTC

Fix for https://bugzilla.redhat.com/show_bug.cgi?id=1809667 addresses this

Comment 22 Ben Bennett 2020-03-06 14:19:26 UTC


*** This bug has been marked as a duplicate of bug 1809667 ***

Comment 23 Red Hat Bugzilla 2023-09-15 01:29:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.

acomabon
aos-bugs
bbennett
bchardim
bleanhar
bmontgom
bpeterse
cblecker
ccoleman
cshereme
dffrench
dmoessne
jokerman
ltsai
mwoodson
nmalik
nmukherj
nstielau
pbergene
ppitonak
rdiazgav
vrutkovs
wking