Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1781763

Summary: [backport 4.3] RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused because Load IP missing from node iptables rules
Product: OpenShift Container Platform Reporter: Casey Callendrello <cdc>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aconstan, adam.kaplan, anbhat, aos-bugs, bpeterse, ccoleman, cdc, deads, dmace, dmoessne, jchaloup, jiajliu, jlebon, kgarriso, lsm5, mfojtik, obulatov, pmuller, sdodson, spadgett, wking, yinzhou, zzhao
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1765280
: 1790704 (view as bug list) Environment:
Last Closed: 2020-01-23 11:18:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1765280    
Bug Blocks: 1790704    

Description Casey Callendrello 2019-12-10 14:25:32 UTC
+++ This bug was initially created as a clone of Bug #1765280 +++

Description of problem:

The authentication operator will sometimes report the following degraded condition:

    RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused

Observed on the following platforms in CI over the past 14 days: gcp

The nature of the error (which looks like an external IP) and the fact that it has only been observed on GCP seem like clues.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


--- Additional comment from Casey Callendrello on 2019-12-10 14:22:45 UTC ---

Fix merged in https://github.com/openshift/openshift-sdn/pull/79. Starting the backport dance.

--- Additional comment from Casey Callendrello on 2019-12-10 14:23:38 UTC ---

meant https://github.com/openshift/sdn/pull/79

Comment 1 Casey Callendrello 2019-12-10 14:36:15 UTC
https://github.com/openshift/sdn/pull/81 filed

Comment 3 Ross Brattain 2019-12-13 01:12:10 UTC
Deployment succeeded on GCP with 4.3.0-0.nightly-2019-12-12-021332

NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.3.0-0.nightly-2019-12-12-021332   True        False         False      6h18m

Comment 4 W. Trevor King 2020-01-14 00:15:43 UTC
> Deployment succeeded on GCP with 4.3.0-0.nightly-2019-12-12-021332

I'm not clear on what the expected flake-rate for this issue is, but in 4.2.13 -> 4.3.0-rc.0 CI today (also on GCP) [1]:

      {
        "type": "Failing",
        "status": "True",
        "lastTransitionTime": "2020-01-13T13:48:11Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Cluster operator authentication is still updating"
      },
      {
        "type": "Progressing",
        "status": "True",
        "lastTransitionTime": "2020-01-13T13:21:48Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Unable to apply 4.3.0-rc.0: the cluster operator authentication has not yet successfully rolled out"
      },

with [2]:

  - lastTransitionTime: "2020-01-13T13:33:02Z"
    message: 'RouteHealthDegraded: failed to GET route: dial tcp 34.74.190.39:443:
      connect: connection refused'
    reason: RouteHealthDegradedFailedGet
    status: "True"
    type: Degraded

And at that time the network operator is still running [3]:

  versions:
  - name: operator
    version: 4.2.13

so I guess this still needs to be cloned back to 4.2.z?

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/214
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/214/artifacts/e2e-gcp-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d24ac732f2fd86150091410623d388ad78196ad7f8072696e85ceaaccb187759/cluster-scoped-resources/config.openshift.io/clusteroperators/authentication.yaml

Comment 5 zhaozhanqi 2020-01-15 11:53:39 UTC
I have a try from 4.2.14 --> 4.3.0-rc.0 with GCP cluster, all cluster operator upgraded successfully.

oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2020-01-15T04:18:18Z"
    generation: 2
    name: version
    resourceVersion: "146557"
    selfLink: /apis/config.openshift.io/v1/clusterversions/version
    uid: 0ef5dd00-374e-11ea-a2ae-42010a000004
  spec:
    channel: stable-4.2
    clusterID: 1d856d0b-d98b-453e-92ec-813bec9f78be
    desiredUpdate:
      force: true
      image: quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64
      version: ""
    upstream: https://api.openshift.com/api/upgrades_info/v1/graph
  status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2020-01-15T04:36:47Z"
      message: Done applying 4.3.0-rc.0
      status: "True"
      type: Available
    - lastTransitionTime: "2020-01-15T11:31:49Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2020-01-15T11:41:17Z"
      message: Cluster version is 4.3.0-rc.0
      status: "False"
      type: Progressing
    - lastTransitionTime: "2020-01-15T04:18:36Z"
      message: 'Unable to retrieve available updates: currently installed version
        4.3.0-rc.0 not found in the "stable-4.2" channel'
      reason: VersionNotFound
      status: "False"
      type: RetrievedUpdates
    desired:
      force: true
      image: quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64
      version: 4.3.0-rc.0
    history:
    - completionTime: "2020-01-15T11:41:17Z"
      image: quay.io/openshift-release-dev/ocp-release:4.3.0-rc.0-x86_64
      startedTime: "2020-01-15T11:04:32Z"
      state: Completed
      verified: false
      version: 4.3.0-rc.0
    - completionTime: "2020-01-15T04:36:47Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:3fabe939da31f9a31f509251b9f73d321e367aba2d09ff392c2f452f6433a95a
      startedTime: "2020-01-15T04:18:36Z"
      state: Completed
      verified: false
      version: 4.2.14
    observedGeneration: 2
    versionHash: CZiJlh_NjCQ=
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""



oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-rc.0   True        False         False      7h16m
cloud-credential                           4.3.0-rc.0   True        False         False      7h32m
cluster-autoscaler                         4.3.0-rc.0   True        False         False      7h22m
console                                    4.3.0-rc.0   True        False         False      16m
dns                                        4.3.0-rc.0   True        False         False      7h32m
image-registry                             4.3.0-rc.0   True        False         False      23m
ingress                                    4.3.0-rc.0   True        False         False      21m
insights                                   4.3.0-rc.0   True        False         False      7h32m
kube-apiserver                             4.3.0-rc.0   True        False         False      7h31m
kube-controller-manager                    4.3.0-rc.0   True        False         False      7h28m
kube-scheduler                             4.3.0-rc.0   True        False         False      7h30m
machine-api                                4.3.0-rc.0   True        False         False      7h32m
machine-config                             4.3.0-rc.0   True        False         False      7h28m
marketplace                                4.3.0-rc.0   True        False         False      15m
monitoring                                 4.3.0-rc.0   True        False         False      13m
network                                    4.3.0-rc.0   True        False         False      7h31m
node-tuning                                4.3.0-rc.0   True        False         False      21m
openshift-apiserver                        4.3.0-rc.0   True        False         False      18m
openshift-controller-manager               4.3.0-rc.0   True        False         False      7h30m
openshift-samples                          4.3.0-rc.0   True        False         False      40m
operator-lifecycle-manager                 4.3.0-rc.0   True        False         False      7h31m
operator-lifecycle-manager-catalog         4.3.0-rc.0   True        False         False      7h31m
operator-lifecycle-manager-packageserver   4.3.0-rc.0   True        False         False      15m
service-ca                                 4.3.0-rc.0   True        False         False      7h32m
service-catalog-apiserver                  4.3.0-rc.0   True        False         False      7h28m
service-catalog-controller-manager         4.3.0-rc.0   True        False         False      7h24m
storage                                    4.3.0-rc.0   True        False         False      39m

Comment 7 errata-xmlrpc 2020-01-23 11:18:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062