Bug 1855055

Summary:	4.4.11->4.5.rc7 upgrade fails with console route not reachable for health check
Product:	OpenShift Container Platform	Reporter:	Mike Fiedler <mifiedle>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	hongli, xtian, yanpzhan
Version:	4.5	Keywords:	Upgrades
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-09 07:08:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Fiedler 2020-07-08 19:16:11 UTC

Description of problem:

Upgrading an OSP 13 cluster from 4.4.11 -> 4.5.0.rc7 stalled/failed on console operator upgrade after 26/32 operators upgraded successfully.  The error in the console operator was:

RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ugdci08204808.qe.devcluster.openshift.com/health): Get https://console-openshift-console.apps.ugdci08204808.qe.devcluster.openshift.com/health: dial tcp 192.168.0.7:443: connect: no route to host

The authentication operator was also degraded with:

RouteHealthDegraded: failed to GET route: dial tcp 192.168.0.7:443: connect: no route to host

Version-Release number of selected component (if applicable): 4.4.11 to 4.5.rc7


How reproducible:  Unknown.  Will start a new run.

Will link must-gather in a private comment.


NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.11       True        True          True       4h44m
cloud-credential                           4.5.0-rc.7   True        False         False      5h13m
cluster-autoscaler                         4.5.0-rc.7   True        False         False      4h58m
config-operator                            4.5.0-rc.7   True        False         False      115m
console                                    4.5.0-rc.7   False       True          True       102m
csi-snapshot-controller                    4.5.0-rc.7   True        False         False      146m
dns                                        4.4.11       True        False         False      5h2m
etcd                                       4.5.0-rc.7   True        False         False      5h2m
image-registry                             4.5.0-rc.7   True        False         False      4h54m
ingress                                    4.5.0-rc.7   True        False         False      4h53m
insights                                   4.5.0-rc.7   True        False         False      4h59m
kube-apiserver                             4.5.0-rc.7   True        False         False      5h2m
kube-controller-manager                    4.5.0-rc.7   True        False         False      5h1m
kube-scheduler                             4.5.0-rc.7   True        False         False      5h1m
kube-storage-version-migrator              4.5.0-rc.7   True        False         False      146m
machine-api                                4.5.0-rc.7   True        False         False      4h59m
machine-approver                           4.5.0-rc.7   True        False         False      104m
machine-config                             4.4.11       True        False         False      155m
marketplace                                4.5.0-rc.7   True        False         False      103m
monitoring                                 4.5.0-rc.7   True        False         False      97m
network                                    4.4.11       True        False         False      5h4m
node-tuning                                4.5.0-rc.7   True        False         False      104m
openshift-apiserver                        4.5.0-rc.7   True        False         False      151m
openshift-controller-manager               4.5.0-rc.7   True        False         False      102m
openshift-samples                          4.5.0-rc.7   True        False         False      101m
operator-lifecycle-manager                 4.5.0-rc.7   True        False         False      5h3m
operator-lifecycle-manager-catalog         4.5.0-rc.7   True        False         False      5h3m
operator-lifecycle-manager-packageserver   4.5.0-rc.7   True        False         False      18m
service-ca                                 4.5.0-rc.7   True        False         False      5h5m
service-catalog-apiserver                  4.4.11       True        False         False      5h5m
service-catalog-controller-manager         4.4.11       True        False         False      5h5m
storage                                    4.5.0-rc.7   True        False         False      105m

Comment 2 Mike Fiedler 2020-07-09 01:29:14 UTC

Also seen upgrading 4.3.27-> 4.4.11-> 4.5.0.rc7 for the profile "UPI on Azure with RHEL7.8 (FIPS off) & Etcd Encryption on"

Comment 3 Mike Fiedler 2020-07-09 02:07:45 UTC

Also reproduced in 4.5.11->4.5.0.rc7 for profile "Disconnected UPI on OSP13 with RHCOS & RHEL7.8(FIPS off)"

Comment 4 Yanping Zhang 2020-07-09 04:07:19 UTC

This is not simply a console issue, I think there should be newtorking issue.
From the console pod log, requested oauth failed. And in oauth and dns pod log, there are errors about timeout and connection refused:
dns pod log:
2020-07-08T14:15:37.218659014-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:40336->192.168.2.126:53: i/o timeout
2020-07-08T14:15:53.247144697-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:42965->192.168.2.126:53: i/o timeout
2020-07-08T14:16:21.315802564-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:41473->192.168.2.126:53: i/o timeout
2020-07-08T14:17:27.512081613-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:56983->192.168.2.126:53: i/o timeout
2020-07-08T14:17:32.512434661-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:44750->192.168.2.126:53: i/o timeout
2020-07-08T14:17:38.531933356-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:57556->192.168.2.126:53: i/o timeout
2020-07-08T14:17:49.555756251-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:50139->192.168.2.126:53: i/o timeout
2020-07-08T14:18:27.652333915-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:38392->192.168.2.126:53: i/o timeout
2020-07-08T14:18:27.652333915-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:51525->192.168.2.126:53: i/o timeout
2020-07-08T14:19:06.82095886-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:33067->192.168.2.126:53: i/o timeout
2020-07-08T14:19:17.97797534-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:59760->192.168.2.126:53: i/o timeout
2020-07-08T14:19:22.979214064-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:38789->192.168.2.126:53: i/o timeout
2020-07-08T14:19:51.065620448-04:00 [ERROR] plugin/errors: 2 quay.io. AAAA: read udp 10.130.2.3:46334->192.168.2.126:53: i/o timeout
2020-07-08T14:19:51.065771132-04:00 [ERROR] plugin/errors: 2 quay.io. A: read udp 10.130.2.3:58715->192.168.2.126:53: i/o timeout
===========================================
oauth pod log:
2020-07-08T18:00:05.833632973Z E0708 18:00:05.833558       1 reflector.go:382] k8s.io/client-go.2/tools/cache/reflector.go:125: Failed to watch *v1.ConfigMap: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dextension-apiserver-authentication&resourceVersion=142622&timeout=8m25s&timeoutSeconds=505&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
2020-07-08T18:00:05.845414673Z E0708 18:00:05.845358       1 reflector.go:382] k8s.io/client-go.2/tools/cache/reflector.go:125: Failed to watch *v1.ConfigMap: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dextension-apiserver-authentication&resourceVersion=141655&timeout=9m1s&timeoutSeconds=541&watch=true: dial tcp 172.30.0.1:443: connect: connection refused
2020-07-08T18:00:06.069692896Z E0708 18:00:06.069578       1 webhook.go:111] Failed to make webhook authenticator request: Post https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp 172.30.0.1:443: connect: connection refused
2020-07-08T18:00:06.069759844Z E0708 18:00:06.069707       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Post https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews: dial tcp 172.30.0.1:443: connect: connection refused]
==========================
console pod log:
2020-07-08T18:19:53.699110874Z 2020-07-08T18:19:53Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ugdci08204808.qe.devcluster.openshift.com/oauth/token failed: Head https://oauth-openshift.apps.ugdci08204808.qe.devcluster.openshift.com: dial tcp 192.168.0.7:443: connect: no route to host

Comment 5 zhaozhanqi 2020-07-09 06:25:42 UTC

(In reply to Mike Fiedler from comment #2)
> Also seen upgrading 4.3.27-> 4.4.11-> 4.5.0.rc7 for the profile "UPI on
> Azure with RHEL7.8 (FIPS off) & Etcd Encryption on"

Hi @Mike I saw you said this issue also be reproduced in Azure platform , not sure if you can have the must-gather logs. I doubt it's another bug for we met yesterday on azure  https://bugzilla.redhat.com/show_bug.cgi?id=1854383#c3

Comment 7 Hongan Li 2020-07-09 06:47:19 UTC

From must-gather, we can find that one of the router pod was scheduled to RHEL worker

  name: router-default-789d8bf48-v29qg
  nodeName: ugdci08204808-xtxvb-rhel-0

it should be duplicate with https://bugzilla.redhat.com/show_bug.cgi?id=1848945

Comment 8 Hongan Li 2020-07-09 06:56:33 UTC

(In reply to zhaozhanqi from comment #5)
> (In reply to Mike Fiedler from comment #2)
> > Also seen upgrading 4.3.27-> 4.4.11-> 4.5.0.rc7 for the profile "UPI on
> > Azure with RHEL7.8 (FIPS off) & Etcd Encryption on"
> 
> Hi @Mike I saw you said this issue also be reproduced in Azure platform ,
> not sure if you can have the must-gather logs. I doubt it's another bug for
> we met yesterday on azure 
> https://bugzilla.redhat.com/show_bug.cgi?id=1854383#c3

if hit https://bugzilla.redhat.com/show_bug.cgi?id=1854383#c3, the ingress operator should be Degraded

Comment 9 zhaozhanqi 2020-07-09 07:08:17 UTC


*** This bug has been marked as a duplicate of bug 1848945 ***