Description of problem: Upgrading an OSP 13 cluster from 4.4.11 -> 4.5.0.rc7 stalled/failed on console operator upgrade after 26/32 operators upgraded successfully. The error in the console operator was: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ugdci08204808.qe.devcluster.openshift.com/health): Get https://console-openshift-console.apps.ugdci08204808.qe.devcluster.openshift.com/health: dial tcp 192.168.0.7:443: connect: no route to host The authentication operator was also degraded with: RouteHealthDegraded: failed to GET route: dial tcp 192.168.0.7:443: connect: no route to host Version-Release number of selected component (if applicable): 4.4.11 to 4.5.rc7 How reproducible: Unknown. Will start a new run. Will link must-gather in a private comment. NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.11 True True True 4h44m cloud-credential 4.5.0-rc.7 True False False 5h13m cluster-autoscaler 4.5.0-rc.7 True False False 4h58m config-operator 4.5.0-rc.7 True False False 115m console 4.5.0-rc.7 False True True 102m csi-snapshot-controller 4.5.0-rc.7 True False False 146m dns 4.4.11 True False False 5h2m etcd 4.5.0-rc.7 True False False 5h2m image-registry 4.5.0-rc.7 True False False 4h54m ingress 4.5.0-rc.7 True False False 4h53m insights 4.5.0-rc.7 True False False 4h59m kube-apiserver 4.5.0-rc.7 True False False 5h2m kube-controller-manager 4.5.0-rc.7 True False False 5h1m kube-scheduler 4.5.0-rc.7 True False False 5h1m kube-storage-version-migrator 4.5.0-rc.7 True False False 146m machine-api 4.5.0-rc.7 True False False 4h59m machine-approver 4.5.0-rc.7 True False False 104m machine-config 4.4.11 True False False 155m marketplace 4.5.0-rc.7 True False False 103m monitoring 4.5.0-rc.7 True False False 97m network 4.4.11 True False False 5h4m node-tuning 4.5.0-rc.7 True False False 104m openshift-apiserver 4.5.0-rc.7 True False False 151m openshift-controller-manager 4.5.0-rc.7 True False False 102m openshift-samples 4.5.0-rc.7 True False False 101m operator-lifecycle-manager 4.5.0-rc.7 True False False 5h3m operator-lifecycle-manager-catalog 4.5.0-rc.7 True False False 5h3m operator-lifecycle-manager-packageserver 4.5.0-rc.7 True False False 18m service-ca 4.5.0-rc.7 True False False 5h5m service-catalog-apiserver 4.4.11 True False False 5h5m service-catalog-controller-manager 4.4.11 True False False 5h5m storage 4.5.0-rc.7 True False False 105m
Also seen upgrading 4.3.27-> 4.4.11-> 4.5.0.rc7 for the profile "UPI on Azure with RHEL7.8 (FIPS off) & Etcd Encryption on"
Also reproduced in 4.5.11->4.5.0.rc7 for profile "Disconnected UPI on OSP13 with RHCOS & RHEL7.8(FIPS off)"
This is not simply a console issue, I think there should be newtorking issue. From the console pod log, requested oauth failed. And in oauth and dns pod log, there are errors about timeout and connection refused: dns pod log: 2020-07-08T14:15:37.218659014-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:40336->192.168.2.126:53: i/o timeout 2020-07-08T14:15:53.247144697-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:42965->192.168.2.126:53: i/o timeout 2020-07-08T14:16:21.315802564-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:41473->192.168.2.126:53: i/o timeout 2020-07-08T14:17:27.512081613-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:56983->192.168.2.126:53: i/o timeout 2020-07-08T14:17:32.512434661-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:44750->192.168.2.126:53: i/o timeout 2020-07-08T14:17:38.531933356-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:57556->192.168.2.126:53: i/o timeout 2020-07-08T14:17:49.555756251-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:50139->192.168.2.126:53: i/o timeout 2020-07-08T14:18:27.652333915-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:38392->192.168.2.126:53: i/o timeout 2020-07-08T14:18:27.652333915-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:51525->192.168.2.126:53: i/o timeout 2020-07-08T14:19:06.82095886-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:33067->192.168.2.126:53: i/o timeout 2020-07-08T14:19:17.97797534-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:59760->192.168.2.126:53: i/o timeout 2020-07-08T14:19:22.979214064-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:38789->192.168.2.126:53: i/o timeout 2020-07-08T14:19:51.065620448-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:46334->192.168.2.126:53: i/o timeout 2020-07-08T14:19:51.065771132-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:58715->192.168.2.126:53: i/o timeout =========================================== oauth pod log: 2020-07-08T18:00:05.833632973Z E0708 18:00:05.833558 1 reflector.go:382] k8s.io/client-go.2/tools/cache/reflector.go:125: Failed to watch *v1.ConfigMap: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dextension-apiserver-authentication&resourceVersion=142622&timeout=8m25s&timeoutSeconds=505&watch=true: dial tcp 172.30.0.1:443: connect: connection refused 2020-07-08T18:00:05.845414673Z E0708 18:00:05.845358 1 reflector.go:382] k8s.io/client-go.2/tools/cache/reflector.go:125: Failed to watch *v1.ConfigMap: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dextension-apiserver-authentication&resourceVersion=141655&timeout=9m1s&timeoutSeconds=541&watch=true: dial tcp 172.30.0.1:443: connect: connection refused 2020-07-08T18:00:06.069692896Z E0708 18:00:06.069578 1 webhook.go:111] Failed to make webhook authenticator request: Post https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp 172.30.0.1:443: connect: connection refused 2020-07-08T18:00:06.069759844Z E0708 18:00:06.069707 1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Post https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp 172.30.0.1:443: connect: connection refused] ========================== console pod log: 2020-07-08T18:19:53.699110874Z 2020-07-08T18:19:53Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ugdci08204808.qe.devcluster.openshift.com/oauth/token failed: Head https://oauth-openshift.apps.ugdci08204808.qe.devcluster.openshift.com: dial tcp 192.168.0.7:443: connect: no route to host
(In reply to Mike Fiedler from comment #2) > Also seen upgrading 4.3.27-> 4.4.11-> 4.5.0.rc7 for the profile "UPI on > Azure with RHEL7.8 (FIPS off) & Etcd Encryption on" Hi @Mike I saw you said this issue also be reproduced in Azure platform , not sure if you can have the must-gather logs. I doubt it's another bug for we met yesterday on azure https://bugzilla.redhat.com/show_bug.cgi?id=1854383#c3
From must-gather, we can find that one of the router pod was scheduled to RHEL worker name: router-default-789d8bf48-v29qg nodeName: ugdci08204808-xtxvb-rhel-0 it should be duplicate with https://bugzilla.redhat.com/show_bug.cgi?id=1848945
(In reply to zhaozhanqi from comment #5) > (In reply to Mike Fiedler from comment #2) > > Also seen upgrading 4.3.27-> 4.4.11-> 4.5.0.rc7 for the profile "UPI on > > Azure with RHEL7.8 (FIPS off) & Etcd Encryption on" > > Hi @Mike I saw you said this issue also be reproduced in Azure platform , > not sure if you can have the must-gather logs. I doubt it's another bug for > we met yesterday on azure > https://bugzilla.redhat.com/show_bug.cgi?id=1854383#c3 if hit https://bugzilla.redhat.com/show_bug.cgi?id=1854383#c3, the ingress operator should be Degraded
*** This bug has been marked as a duplicate of bug 1848945 ***