Bug 1781763 - [backport 4.3] RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused because Load IP missing from node iptables rules
Summary: [backport 4.3] RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.3.0
Assignee: Casey Callendrello
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On: 1765280
Blocks: 1790704
TreeView+ depends on / blocked
 
Reported: 2019-12-10 14:25 UTC by Casey Callendrello
Modified: 2020-01-23 11:18 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1765280
: 1790704 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:18:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sdn pull 81 0 None closed Bug 1781763: [release-4.3] proxy: add handler with same ResyncPeriod as shared informer. 2021-02-18 16:09:58 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:18:42 UTC

Description Casey Callendrello 2019-12-10 14:25:32 UTC
+++ This bug was initially created as a clone of Bug #1765280 +++

Description of problem:

The authentication operator will sometimes report the following degraded condition:

    RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused

Observed on the following platforms in CI over the past 14 days: gcp

The nature of the error (which looks like an external IP) and the fact that it has only been observed on GCP seem like clues.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


--- Additional comment from Casey Callendrello on 2019-12-10 14:22:45 UTC ---

Fix merged in https://github.com/openshift/openshift-sdn/pull/79. Starting the backport dance.

--- Additional comment from Casey Callendrello on 2019-12-10 14:23:38 UTC ---

meant https://github.com/openshift/sdn/pull/79

Comment 1 Casey Callendrello 2019-12-10 14:36:15 UTC
https://github.com/openshift/sdn/pull/81 filed

Comment 3 Ross Brattain 2019-12-13 01:12:10 UTC
Deployment succeeded on GCP with 4.3.0-0.nightly-2019-12-12-021332

NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.3.0-0.nightly-2019-12-12-021332   True        False         False      6h18m

Comment 4 W. Trevor King 2020-01-14 00:15:43 UTC
> Deployment succeeded on GCP with 4.3.0-0.nightly-2019-12-12-021332

I'm not clear on what the expected flake-rate for this issue is, but in 4.2.13 -> 4.3.0-rc.0 CI today (also on GCP) [1]:

      {
        "type": "Failing",
        "status": "True",
        "lastTransitionTime": "2020-01-13T13:48:11Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Cluster operator authentication is still updating"
      },
      {
        "type": "Progressing",
        "status": "True",
        "lastTransitionTime": "2020-01-13T13:21:48Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Unable to apply 4.3.0-rc.0: the cluster operator authentication has not yet successfully rolled out"
      },

with [2]:

  - lastTransitionTime: "2020-01-13T13:33:02Z"
    message: 'RouteHealthDegraded: failed to GET route: dial tcp 34.74.190.39:443:
      connect: connection refused'
    reason: RouteHealthDegradedFailedGet
    status: "True"
    type: Degraded

And at that time the network operator is still running [3]:

  versions:
  - name: operator
    version: 4.2.13

so I guess this still needs to be cloned back to 4.2.z?

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/214
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/214/artifacts/e2e-gcp-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d24ac732f2fd86150091410623d388ad78196ad7f8072696e85ceaaccb187759/cluster-scoped-resources/config.openshift.io/clusteroperators/authentication.yaml

Comment 5 zhaozhanqi 2020-01-15 11:53:39 UTC
I have a try from 4.2.14 --> 4.3.0-rc.0 with GCP cluster, all cluster operator upgraded successfully.

oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2020-01-15T04:18:18Z"
    generation: 2
    name: version
    resourceVersion: "146557"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: 0ef5dd00-374e-11ea-a2ae-42010a000004
  spec:
    channel: stable-4.2
    clusterID: 1d856d0b-d98b-453e-92ec-813bec9f78be
    desiredUpdate:
      force: true
      image: quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64
      version: ""
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2020-01-15T04:36:47Z"
      message: Done applying 4.3.0-rc.0
      status: "True"
      type: Available
    - lastTransitionTime: "2020-01-15T11:31:49Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2020-01-15T11:41:17Z"
      message: Cluster version is 4.3.0-rc.0
      status: "False"
      type: Progressing
    - lastTransitionTime: "2020-01-15T04:18:36Z"
      message: 'Unable to retrieve available updates: currently installed version
        4.3.0-rc.0 not found in the "stable-4.2" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    desired:
      force: true
      image: quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64
      version: 4.3.0-rc.0
    history:
    - completionTime: "2020-01-15T11:41:17Z"
      image: quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64
      startedTime: "2020-01-15T11:04:32Z"
      state: Completed
      verified: false
      version: 4.3.0-rc.0
    - completionTime: "2020-01-15T04:36:47Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:3fabe939da31f9a31f509251b9f73d321e367aba2d09ff392c2f452f6433a95a
      startedTime: "2020-01-15T04:18:36Z"
      state: Completed
      verified: false
      version: 4.2.14
    observedGeneration: 2
    versionHash: CZiJlh_NjCQ=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""



oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-rc.0   True        False         False      7h16m
cloud-credential                           4.3.0-rc.0   True        False         False      7h32m
cluster-autoscaler                         4.3.0-rc.0   True        False         False      7h22m
console                                    4.3.0-rc.0   True        False         False      16m
dns                                        4.3.0-rc.0   True        False         False      7h32m
image-registry                             4.3.0-rc.0   True        False         False      23m
ingress                                    4.3.0-rc.0   True        False         False      21m
insights                                   4.3.0-rc.0   True        False         False      7h32m
kube-apiserver                             4.3.0-rc.0   True        False         False      7h31m
kube-controller-manager                    4.3.0-rc.0   True        False         False      7h28m
kube-scheduler                             4.3.0-rc.0   True        False         False      7h30m
machine-api                                4.3.0-rc.0   True        False         False      7h32m
machine-config                             4.3.0-rc.0   True        False         False      7h28m
marketplace                                4.3.0-rc.0   True        False         False      15m
monitoring                                 4.3.0-rc.0   True        False         False      13m
network                                    4.3.0-rc.0   True        False         False      7h31m
node-tuning                                4.3.0-rc.0   True        False         False      21m
openshift-apiserver                        4.3.0-rc.0   True        False         False      18m
openshift-controller-manager               4.3.0-rc.0   True        False         False      7h30m
openshift-samples                          4.3.0-rc.0   True        False         False      40m
operator-lifecycle-manager                 4.3.0-rc.0   True        False         False      7h31m
operator-lifecycle-manager-catalog         4.3.0-rc.0   True        False         False      7h31m
operator-lifecycle-manager-packageserver   4.3.0-rc.0   True        False         False      15m
service-ca                                 4.3.0-rc.0   True        False         False      7h32m
service-catalog-apiserver                  4.3.0-rc.0   True        False         False      7h28m
service-catalog-controller-manager         4.3.0-rc.0   True        False         False      7h24m
storage                                    4.3.0-rc.0   True        False         False      39m

Comment 7 errata-xmlrpc 2020-01-23 11:18:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.