Description of problem: OpenShift Console becomes unavailable for long time (~30 minutes) during z-stream upgrade of OpenShift Dedicated cluster with 44 nodes (6 master, 2 infra, 36 compute). Version-Release number of selected component (if applicable): OCP 4.2.12 How reproducible: Steps to Reproduce: 1. Setup a OCP cluster with 44 nodes. 2. Start z-stream upgrade. 3. Try accessing cluster console during the upgrade process. Actual results: Cluster console is unavailable for ~30 minutes. Expected results: Cluster console should be accessible during cluster upgrade. Additional info: Without cluster console access, we cannot get cli tokens for OpenShift.
Can you provide further information on this? For example, what is the reporting from Console Operator, Authentication Operator or Ingress Operator? Is there any information from the Console Pod log?
Are you able to reproduce this latency for the 'oc' command line as well? I'm wondering if it is actually the API server itself. Also, is 6 masters the usual configuration for a cluster of this size?
Sorry it is 3 master nodes and not 6. I am not able to provide any logs from the cluster because tenants of OpenShift Dedicated do not have access to them. However, OpenShift Dedicated SRE team may be able to provide any logs you need.
Ok great, if you can connect with them about the cluster you were working on and get us some more info, that will help us know what to do. Thanks!
Both operators have hundreds of lines of this in the logs: W0128 20:03:44.115364 1 reflector.go:289] github.com/openshift/client-go/route/informers/externalversions/factory.go:101: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old. W0128 20:03:47.445763 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 40979893 (40984697) W0128 20:04:17.985485 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 40981817 (40984964) W0128 20:06:09.424946 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.Deployment ended with: too old resource version: 40978581 (40981175) W0128 20:06:24.388878 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 40983221 (40986137) W0128 20:06:55.448181 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 40983722 (40986413) This is a normal message to have occasionally, it is not normal to have a constant stream. Is the API happy?
Marking "needs info" again as we don't have anything definitive at this point, other than it seems like the operators are not able to reach the API for an extended amount of time. tparikh gathering more information on the next upgrade (tues) seems like the best path forward.
bpeterse what specific info would you collect as a part of the next upgrade? We can grab a must-gather after the upgrade, anything else that would be helpful?
Hi All, I am seeing a slightly different issue but related. I am seeing 100% downtime for all routes during an OpenShift upgrade, during what looks to be the router redeployment and image change. What have I tested? I have been running external performance tests for both 3scale[1] and AMQ Online[2] against an RHMI OSD staging cluster. The performance tests are not testing load but sending requests at a consistent rate to determine downtime during an upgrade. How did I upgrade the cluster? Normal upgrade from 4.2.8 to 4.2.9 using "oc adm upgrade --to 4.2.8" What would I like you to test? Naveen asked in the last comment, what info would be good to gather during an upgrade. I would like to see a similar test run against a cluster that is being upgraded to see if there is downtime due to the router when the cluster is upgraded. Note: See AMQ Online and 3scale attachments to see downtime Note: Events from the two timeframes have been added as screenshots [1] https://github.com/3scale/perftest-toolkit [2] https://github.com/EnMasseProject/external-test-clients
Created attachment 1662156 [details] Cluster Upgrade 3scale downtime
Created attachment 1662157 [details] Cluster Upgrade AMQ Online downtime
Created attachment 1662158 [details] openshift-ingress namespace events 12:04
Created attachment 1662159 [details] openshift-ingress namespace events 12:11
This is no longer a console issue, passing it over to Routing.
dmace This is a critical issue, it means downtime during every OSD 4.x upgrade which happens weekly. Could you tell me when it will be looked into? I am free to talk about the issue over BJ also at anytime.
Ben and Clayton, is this a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1785457?
Spoke with Clayton, and we decided to be conservative and keep this bug for now just in case the SDN fixes coming for 1785457 don't account for all the reports here. It's possible that once the SDN dust has settled we'll be more clearly faced with some known potential disruptions with ingress specifically which we know are theoretically possible since 4.1 but for which we haven't fully addressed yet (although we've made significant progress). The ingress disruptions I'm referring to are scoped very narrowly to disruptions during ingress-controller rollout[1][2][3], while the SDN issues discovered in https://bugzilla.redhat.com/show_bug.cgi?id=1785457 impact ingress indirectly. [1] https://issues.redhat.com/browse/NE-203 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1709958 [3] https://docs.google.com/document/d/1GP17EBWb2bj4fz7dr3QUxK8leZTPD7oCNr9npLDF1wI/edit#heading=h.exa2qjxyht92
*** Bug 1805690 has been marked as a duplicate of this bug. ***
Fix for https://bugzilla.redhat.com/show_bug.cgi?id=1809667 addresses this
*** This bug has been marked as a duplicate of bug 1809667 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days