1790704 – [backport 4.2] RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused because Load IP missing from node iptables rules

Bug 1790704 - [backport 4.2] RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused because Load IP missing from node iptables rules

Summary: [backport 4.2] RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.z
Assignee:	Aniket Bhat
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1789583 (view as bug list)
Depends On:	1781763
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-14 00:17 UTC by W. Trevor King
Modified:	2021-04-05 17:46 UTC (History)
CC List:	28 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1781763
Environment:
Last Closed:	2020-02-12 12:16:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sdn pull 90	0	None	closed	Bug 1790704: proxy: add handler with same ResyncPeriod as shared informer.	2020-10-29 12:28:40 UTC
Red Hat Product Errata	RHBA-2020:0395	0	None	None	None	2020-02-12 12:16:42 UTC

Description W. Trevor King 2020-01-14 00:17:20 UTC

+++ This bug was initially created as a clone of Bug #1781763 +++

+++ This bug was initially created as a clone of Bug #1765280 +++

Description of problem:

The authentication operator will sometimes report the following degraded condition:

    RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused

Observed on the following platforms in CI over the past 14 days: gcp

The nature of the error (which looks like an external IP) and the fact that it has only been observed on GCP seem like clues.

...

In 4.2.13 -> 4.3.0-rc.0 CI today (also on GCP) [1]:

      {
        "type": "Failing",
        "status": "True",
        "lastTransitionTime": "2020-01-13T13:48:11Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Cluster operator authentication is still updating"
      },
      {
        "type": "Progressing",
        "status": "True",
        "lastTransitionTime": "2020-01-13T13:21:48Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Unable to apply 4.3.0-rc.0: the cluster operator authentication has not yet successfully rolled out"
      },

with [2]:

  - lastTransitionTime: "2020-01-13T13:33:02Z"
    message: 'RouteHealthDegraded: failed to GET route: dial tcp 34.74.190.39:443:
      connect: connection refused'
    reason: RouteHealthDegradedFailedGet
    status: "True"
    type: Degraded

And at that time the network operator is still running [3]:

  versions:
  - name: operator
    version: 4.2.13

so I guess this still needs to be cloned back to 4.2.z.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/214
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/214/artifacts/e2e-gcp-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d24ac732f2fd86150091410623d388ad78196ad7f8072696e85ceaaccb187759/cluster-scoped-resources/config.openshift.io/clusteroperators/authentication.yaml

Comment 1 W. Trevor King 2020-01-14 00:31:16 UTC

Oh, original comment is missing:

[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/214/artifacts/e2e-gcp-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d24ac732f2fd86150091410623d388ad78196ad7f8072696e85ceaaccb187759/cluster-scoped-resources/config.openshift.io/clusteroperators/network.yaml

Comment 4 Alexander Constantinescu 2020-01-22 14:49:15 UTC

*** Bug 1789583 has been marked as a duplicate of this bug. ***

Comment 5 Weibin Liang 2020-01-22 14:59:16 UTC

No authentication failure found when deploy GCP with 4.4.0-0.nightly-2019-12-13-170401


[root@dhcp-41-193 FILE]# oc get nodes
NAME                                             STATUS   ROLES    AGE   VERSION
qe-wel-m8szx-m-0.c.openshift-qe.internal         Ready    master   25m   v1.14.6+c383847f6
qe-wel-m8szx-m-1.c.openshift-qe.internal         Ready    master   25m   v1.14.6+c383847f6
qe-wel-m8szx-m-2.c.openshift-qe.internal         Ready    master   25m   v1.14.6+c383847f6
qe-wel-m8szx-w-a-kljqn.c.openshift-qe.internal   Ready    worker   14m   v1.14.6+c383847f6
qe-wel-m8szx-w-b-cprzx.c.openshift-qe.internal   Ready    worker   14m   v1.14.6+c383847f6
[root@dhcp-41-193 FILE]# oc get clusteroperator | grep authentication
authentication                             4.2.0-0.nightly-2020-01-22-023656   True        False         False      8m23s
[root@dhcp-41-193 FILE]#

Comment 7 errata-xmlrpc 2020-02-12 12:16:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0395

Comment 8 W. Trevor King 2021-04-05 17:46:31 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.

aconstan
adam.kaplan
anbhat
aos-bugs
bbennett
bleanhar
bpeterse
ccoleman
cdc
deads
dmace
dmoessne
jchaloup
jiajliu
jlebon
kgarriso
lsm5
mfojtik
obulatov
pmuller
rbrattai
sdodson
spadgett
weliang
wking
wsun
yinzhou
zzhao