Bug 1765280 - RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused because Load IP missing from node iptables rules
Summary: RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connect...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.4.0
Assignee: Casey Callendrello
QA Contact: Ross Brattain
URL:
Whiteboard:
: 1751916 1752358 1762139 1766664 1768256 1771732 1772188 1779710 1781727 (view as bug list)
Depends On:
Blocks: 1752358 1781763
TreeView+ depends on / blocked
 
Reported: 2019-10-24 17:56 UTC by Dan Mace
Modified: 2023-09-07 20:52 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1781763 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:14:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:15:17 UTC

Description Dan Mace 2019-10-24 17:56:07 UTC
Description of problem:

The authentication operator will sometimes report the following degraded condition:

    RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused

Observed on the following platforms in CI over the past 14 days: gcp

The nature of the error (which looks like an external IP) and the fact that it has only been observed on GCP seem like clues.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dan Mace 2019-10-24 19:01:50 UTC
Observed on Azure: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/5567/rehearse-5567-pull-ci-openshift-cloud-credential-operator-master-e2e-azure/6

The general failure mode seems to be installer failure:

  level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-x3hpts6c-395e7.ci.azure.devcluster.openshift.com:6443 to initialize..."
  level=error msg="Cluster operator authentication Degraded is True with RouteHealthDegradedFailedGet: RouteHealthDegraded: failed to GET route: dial tcp 52.230.227.98:443: connect: connection refused"

Comment 3 Dan Mace 2019-10-28 21:29:42 UTC
I think what's happening here is the auth operator is scheduled to a master, and the master has no iptables rules which would enable a pod to route to the external IP of a load balancer service. This would indicate that LoadBalancer Services generally (and thus Route or Ingress) from a master are not working when this error manifests. I'm going to copy Clayton who has been tracking some of the issues in the area of master/load balancer interactions.

Comment 5 Dan Mace 2019-10-29 00:06:26 UTC
Abhinav noted that the connection refused error is especially strange, and could indicate the traffic is actually making it to the load balancer. Normally, these requests would remain cluster-local through iptables rules, but if the client request originates from the master with the iptables rules noted, maybe the packet does get refused by the LB for some reason (or maybe the LB routes it back to the node, which could reject).

Next step is to reproduce these scenarios on GCP.

Comment 6 Dan Mace 2019-10-29 12:02:02 UTC
On my new 4.3.0-0.okd-2019-10-28-235941 GCP cluster I'm seeing the correct chains on my masters and routing is fine e.g.

    -A KUBE-SERVICES -d 35.190.186.118/32 -p tcp -m comment --comment "openshift-ingress/router-default:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-MBAZS3WDHL45BPIZ
    -A KUBE-SERVICES -d 35.190.186.118/32 -p tcp -m comment --comment "openshift-ingress/router-default:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-HEVFQXAKPPGAL4BV

Based on the runs I've analyzed, when the issue happens in CI, these rules are missing entirely from the masters. If I manually delete the https rule from my master, I can reproduce the connection refused error:

    [root@dmace-m5kzn-m-0 /]# iptables -t nat -D KUBE-SERVICES -d 35.190.186.118/32 -p tcp -m comment --comment "openshift-ingress/router-default:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-MBAZS3WDHL45BPIZ
    [root@dmace-m5kzn-m-0 /]# curl -k https://35.190.186.118

    curl: (7) Failed to connect to 35.190.186.118 port 443: Connection refused

I'm now convinced this is an SDN issue. On GCP and Azure, pod requests to a cloud LB through a LoadBalancer Service should remain local to the cluster and not actually reach the load balancer.

Comment 7 Casey Callendrello 2019-10-29 12:29:40 UTC
Interesting! Before we go crazy in to this, is there any chance kube-proxy is doing the right thing and not configuring the rules based on any TrafficPolicy?

Comment 8 Dan Mace 2019-10-29 12:35:02 UTC
In all cases, routers are deployed on workers. On new clusters, the rules are present on masters. And sometimes, the rules disappear from masters. Given how the routers are always on workers in these cases, I'm not sure how the inconsistent application of rules would ever be correct. Given the inputs, wouldn't it be reasonable for the rule to either always exist or not on the masters?

Comment 9 Casey Callendrello 2019-10-29 13:02:11 UTC
I wonder if it's because we're using ExternalTrafficPolicy=local and not setting an external IP:

 "spec": {
                "clusterIP": "172.30.241.255",
                "externalTrafficPolicy": "Local",
                "healthCheckNodePort": 31858,
                "ports": [
                    {
                        "name": "http",
                        "nodePort": 31921,
                        "port": 80,
                        "protocol": "TCP",
                        "targetPort": "http"
                    },
                    {
                        "name": "https",
                        "nodePort": 31928,
                        "port": 443,
                        "protocol": "TCP",
                        "targetPort": "https"
                    }
                ],
                "selector": {
                    "ingresscontroller.operator.openshift.io/deployment-ingresscontroller": "default"
                },
                "sessionAffinity": "None",
                "type": "LoadBalancer"
            },

Comment 10 Casey Callendrello 2019-10-29 14:29:49 UTC
Wow, this code is complicated.

I imagine you're right, Dan. If these endpoints appear then disappear, then it's unlikely that they should.

Comment 11 Casey Callendrello 2019-10-29 15:34:26 UTC
One node has weird iptables rules: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/5567/rehearse-5567-pull-ci-openshift-cloud-credential-operator-master-e2e-azure/6/artifacts/e2e-azure/network/iptables-save-sdn-tz2vf

However, looking at the logs for this pod, it doesn't seem to miss the service (we don't log every possible event). So it's not obvious what the issue is.

Either way, the first step is to merge the known informer bug.

Comment 12 Dan Mace 2019-11-01 13:10:14 UTC
*** Bug 1766664 has been marked as a duplicate of this bug. ***

Comment 14 Casey Callendrello 2019-11-01 17:02:29 UTC
We're hoping that https://github.com/openshift/sdn/pull/60 can fix this.

Comment 15 Samuel Padgett 2019-11-04 13:38:35 UTC
*** Bug 1768256 has been marked as a duplicate of this bug. ***

Comment 17 Casey Callendrello 2019-11-05 18:39:45 UTC
Sounds like that didn't fix it. Nuts.

I've asked Aniket to file a temporary PR to add more logging. We'll revert it before we ship. Then we can have a hope of debugging this.

Comment 18 liujia 2019-11-11 07:27:05 UTC
Also hit the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1768256 when trigger installation on upi/gcp. So tracked it in the bug.

version: 4.3.0-0.nightly-2019-11-10-152059

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          69m     Unable to apply 4.3.0-0.nightly-2019-11-10-152059: the cluster operator console has not yet successfully rolled out

# oc logs pod/console-67c588bdd9-lp45j -n openshift-console
2019/11/11 07:23:58 cmd/main: cookies are secure!
2019/11/11 07:23:58 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.jliu-bug.qe.gcp.devcluster.openshift.com/oauth/token failed: Head https://oauth-openshift.apps.jliu-bug.qe.gcp.devcluster.openshift.com: dial tcp 35.222.207.160:443: connect: connection refused
2019/11/11 07:24:08 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.jliu-bug.qe.gcp.devcluster.openshift.com/oauth/token failed: Head https://oauth-openshift.apps.jliu-bug.qe.gcp.devcluster.openshift.com: dial tcp 35.222.207.160:443: connect: connection refused

Comment 19 Dan Mace 2019-11-12 21:40:55 UTC
*** Bug 1771732 has been marked as a duplicate of this bug. ***

Comment 20 Casey Callendrello 2019-11-13 12:17:49 UTC
Assigning to Aniket, who has been looking at this.

Alexander, can you help him get to the bottom of this.

Summary: Node iptables rules should also have the public IP endpoints of load balancers (they act like another ClusterIP). However, there seem to be bugs where this occasionally is not working on a single (or so) node.

So it seems like it's missing the update where the service's load balancer IPs are set.

Comment 21 Casey Callendrello 2019-11-13 12:20:08 UTC
The question: is this a weird informer issue, or is there somehow a bug in the code that can miss the no load balancer -> load balancer transition.

Comment 22 Casey Callendrello 2019-11-13 12:25:58 UTC
Another failure due to this: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/358/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/153

the IP is missing from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/358/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/153/artifacts/e2e-gcp/network/iptables-save-sdn-2j4lw

{
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "creationTimestamp": "2019-11-12T18:20:38Z",
                "finalizers": [
                    "ingress.openshift.io/operator",
                    "service.kubernetes.io/load-balancer-cleanup"
                ],
                "labels": {
                    "app": "router",
                    "ingresscontroller.operator.openshift.io/owning-ingresscontroller": "default",
                    "router": "router-default"
                },
                "name": "router-default",
                "namespace": "openshift-ingress",
                "ownerReferences": [
                    {
                        "apiVersion": "apps/v1",
                        "controller": true,
                        "kind": "Deployment",
                        "name": "router-default",
                        "uid": "fa7f5a3e-00c6-4fdc-b3bc-ea2c2d92f7b4"
                    }
                ],
                "resourceVersion": "10525",
                "selfLink": "/api/v1/namespaces/openshift-ingress/services/router-default",
                "uid": "3c79dfae-31bc-44e4-a1b2-32cb0fc2cb09"
            },
            "spec": {
                "clusterIP": "172.30.111.81",
                "externalTrafficPolicy": "Local",
                "healthCheckNodePort": 31437,
                "ports": [
                    {
                        "name": "http",
                        "nodePort": 31504,
                        "port": 80,
                        "protocol": "TCP",
                        "targetPort": "http"
                    },
                    {
                        "name": "https",
                        "nodePort": 31431,
                        "port": 443,
                        "protocol": "TCP",
                        "targetPort": "https"
                    }
                ],
                "selector": {
                    "ingresscontroller.operator.openshift.io/deployment-ingresscontroller": "default"
                },
                "sessionAffinity": "None",
                "type": "LoadBalancer"
            },
            "status": {
                "loadBalancer": {
                    "ingress": [
                        {
                            "ip": "35.231.203.32"
                        }
                    ]
                }
            }
        },
        {

Comment 23 Casey Callendrello 2019-11-13 13:07:35 UTC
*** Bug 1762139 has been marked as a duplicate of this bug. ***

Comment 26 Casey Callendrello 2019-11-14 12:51:28 UTC
*** Bug 1772188 has been marked as a duplicate of this bug. ***

Comment 27 Casey Callendrello 2019-11-14 12:52:58 UTC
Got a test failure for this:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24147/pull-ci-openshift-origin-master-e2e-gcp-builds/508

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/24147/pull-ci-openshift-origin-master-e2e-gcp-builds/508/artifacts/e2e-gcp-builds/network/iptables-save-sdn-wqzvl is missing the IP
SDN logs in https://storage.googleapis.com/origin-ci-test/pr-logs/pull/24147/pull-ci-openshift-origin-master-e2e-gcp-builds/508/artifacts/e2e-gcp-builds/pods/openshift-sdn_sdn-wqzvl_sdn.log

With the added informer logging, we do see the correct event:

I1114 11:22:23.445081    4033 informers.go:145] UPDATE &Service{ObjectMeta:{router-default  openshift-ingress /api/v1/namespaces/openshift-ingress/services/router-default d2a33c04-e25f-443c-b102-7b9987229f21 9659 0 2019-11-14 11:21:34 +0000 UTC <nil> <nil> map[app:router ingresscontroller.operator.openshift.io/owning-ingresscontroller:default router:router-default] map[] [{apps/v1 Deployment router-default bd88075a-4793-4f94-a8bd-138783e7d5d9 0xc000dd0957 <nil>}] [ingress.openshift.io/operator service.kubernetes.io/load-balancer-cleanup]  []},Spec:ServiceSpec{Ports:[]ServicePort{ServicePort{Name:http,Protocol:TCP,Port:80,TargetPort:{1 0 http},NodePort:30902,},ServicePort{Name:https,Protocol:TCP,Port:443,TargetPort:{1 0 https},NodePort:32757,},},Selector:map[string]string{ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default,},ClusterIP:172.30.211.249,Type:LoadBalancer,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:Local,HealthCheckNodePort:30071,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{},},},} -> &Service{ObjectMeta:{router-default  openshift-ingress /api/v1/namespaces/openshift-ingress/services/router-default d2a33c04-e25f-443c-b102-7b9987229f21 10697 0 2019-11-14 11:21:34 +0000 UTC <nil> <nil> map[app:router ingresscontroller.operator.openshift.io/owning-ingresscontroller:default router:router-default] map[] [{apps/v1 Deployment router-default bd88075a-4793-4f94-a8bd-138783e7d5d9 0xc000eb56f7 <nil>}] [ingress.openshift.io/operator service.kubernetes.io/load-balancer-cleanup]  []},Spec:ServiceSpec{Ports:[]ServicePort{ServicePort{Name:http,Protocol:TCP,Port:80,TargetPort:{1 0 http},NodePort:30902,},ServicePort{Name:https,Protocol:TCP,Port:443,TargetPort:{1 0 https},NodePort:32757,},},Selector:map[string]string{ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default,},ClusterIP:172.30.211.249,Type:LoadBalancer,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:Local,HealthCheckNodePort:30071,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{LoadBalancerIngress{IP:35.185.124.124,Hostname:,},},},},}

Time to see why this isn't being handled properly.

Comment 29 Aniket Bhat 2019-11-14 16:32:23 UTC
Assigning back to Casey.

Comment 31 Casey Callendrello 2019-11-14 16:45:09 UTC
Filed https://github.com/openshift/sdn/pull/71 to get more logging.

Comment 32 Casey Callendrello 2019-11-14 21:09:14 UTC
Update: check out https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_sdn/71/pull-ci-openshift-sdn-master-e2e-gcp/11/artifacts/e2e-gcp/pods/openshift-sdn_sdn-mddrk_sdn.log

at 19:58:39.348317 the informer debugger gets the update that adds the loadbalancer IP, but not the service Update method. So either there is a bug in informers (very unlikely) or the hybrid proxier is broken. Let's test this theory.

Comment 33 Casey Callendrello 2019-11-15 08:45:33 UTC
With extra logging added (but no logic changes), CI passed twice. Either I've gotten unlucky, or this is a very sensitive heisenbug.

Comment 35 Casey Callendrello 2019-11-22 14:20:34 UTC
*** Bug 1751916 has been marked as a duplicate of this bug. ***

Comment 38 W. Trevor King 2019-12-04 00:10:17 UTC
One-off "seen here"'s are probably not all that useful.  It's less anecdotal to post flake rates like: 35 of these in the past 24h, showing up in 2% of our failed e2e jobs [1].

[1]: https://search.svc.ci.openshift.org/chart?search=RouteHealthDegraded:%20failed%20to%20GET%20route.*connection%20refused

Comment 39 W. Trevor King 2019-12-04 00:22:29 UTC
I've filed bug 1779429 about the "TLS handshake timeout" form of RouteHealthDegraded.  The job I cite there also saw an EOF form.  Dunno if they have the same underlying cause or not.

Comment 40 Samuel Padgett 2019-12-04 15:09:20 UTC
*** Bug 1779710 has been marked as a duplicate of this bug. ***

Comment 41 Samuel Padgett 2019-12-04 15:46:36 UTC
(In reply to W. Trevor King from comment #38)
> One-off "seen here"'s are probably not all that useful.  It's less anecdotal
> to post flake rates like: 35 of these in the past 24h, showing up in 2% of
> our failed e2e jobs [1].
> 
> [1]:
> https://search.svc.ci.openshift.org/chart?search=RouteHealthDegraded:
> %20failed%20to%20GET%20route.*connection%20refused

It's actually much more frequent than this because the same underlying problem can manifest with other errors as in bug 1779710. We're seeing this extremely frequently in console CI.

https://search.svc.ci.openshift.org/chart?search=Cluster%20operator%20console%20Available%20is%20False%20with%20DeploymentAvailableFailedUpdate

Comment 42 Casey Callendrello 2019-12-05 19:01:04 UTC
Got another reproduction: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_sdn/74/pull-ci-openshift-sdn-master-e2e-gcp/46/artifacts/e2e-gcp/pods/openshift-sdn_sdn-8w8qt_sdn.log

Extremely disconcerting. Added logging to the proxy handler as well as a separate debug handler. Here are the timelines of updates to router-default as well as the Old and New ResourceVersion:

proxy:
15:39:55.117801 0 -> 12628
15:39:55.143645 12628 -> 12636
15:41:48.870121 13818 -> 13818 BUG!

informer:
15:39:55.118055 0 -> 12628
15:39:55.144002 12628 -> 12636
15:40:09.501553 12636 -> 12636
15:40:39.502596 12636 -> 12636
15:40:48.880045 12636 -> 13818 (IP added here)
15:41:18.874704 13818 -> 13818

So, the missed update was around 15:40:48.880045. Something of interest: Just before this point in time the apiserver must have restarted (maybe transition off of bootstrap):

E1205 15:40:47.831537    4830 reflector.go:280] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: unknown (get services)

Comment 43 Casey Callendrello 2019-12-05 22:10:02 UTC
Another even crazier one:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_sdn/74/pull-ci-openshift-sdn-master-e2e-gcp/48/artifacts/e2e-gcp/pods/openshift-sdn_sdn-fx282_sdn.log

Our debug informer gets the ADD, but the proxy does not.

Just to test a thesis, I'm going to disable the hybrid proxy and try again.

Comment 45 Casey Callendrello 2019-12-06 09:54:31 UTC
Okay, I have a better idea of what is going on, and it doesn't look like the hybrid proxy is (directly) at fault. Let's look at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_sdn/74/pull-ci-openshift-sdn-master-e2e-gcp/54/artifacts/e2e-gcp/pods/openshift-sdn_sdn-hm9lx_sdn.log

There is a single SharedInformer, with two handlers: a simple debugging one, and the proxy itself. At this point in time, the logging calls I've added are all in upstream code, so it's not the hybrid proxier at work here.

I see a period of time where the debugging handler gets more calls than the proxy handler. This always seems to follow apiserver disruption (i.e. bootstrap or new config rollout). For example:

07:52:53.707847 <- last call to the proxy handler
(api server disruption, reconnection)
Lots of calls to debug informer UPDATE(service)
07:53:39.556409 <- next call

Comment 46 Casey Callendrello 2019-12-06 10:17:44 UTC
Changelog for the buggy service:

debug:
07:52:19.951681 0 -> 15303
07:52:20.013841 15303 -> 15313
07:52:36.589822 15313 -> 15313
07:53:06.591883 15313 -> 15313
07:53:28.806916 15313 -> 17133
07:53:58.684469 17133 -> 17133

proxy:
07:52:19.951681 0 -> 15303
07:52:20.013241 15303 -> 15313
07:54:28.684664 17133 -> 17133

Comment 47 Casey Callendrello 2019-12-06 14:09:10 UTC
We think we've figured this out:

https://github.com/openshift/sdn/pull/74/commits/5cf82fdf11fbb3fea3a18d4245a55838ebff5b65

TL;DR: shared informer bug. Yikes.

Comment 48 Casey Callendrello 2019-12-06 21:45:45 UTC
Filed https://github.com/kubernetes/kubernetes/pull/86015 to fix the client-go bug.

Comment 49 Samuel Padgett 2019-12-10 13:59:08 UTC
*** Bug 1781727 has been marked as a duplicate of this bug. ***

Comment 50 Casey Callendrello 2019-12-10 14:22:45 UTC
Fix merged in https://github.com/openshift/openshift-sdn/pull/79. Starting the backport dance.

Comment 51 Casey Callendrello 2019-12-10 14:23:38 UTC
meant https://github.com/openshift/sdn/pull/79

Comment 53 Ross Brattain 2019-12-16 17:00:02 UTC
Deployment succeeded on GCP with 4.4.0-0.nightly-2019-12-13-170401

NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.4.0-0.nightly-2019-12-13-170401   True        False         False      19m

Comment 54 Ricardo Carrillo Cruz 2020-02-05 14:31:35 UTC
*** Bug 1752358 has been marked as a duplicate of this bug. ***

Comment 56 Abu Davis 2020-03-20 17:24:48 UTC
Hello There, we are running Openshift 4.2.20 on Azure cloud and the "Authentication" operator is in a degraded state resulting in login problems for the applications we are running on the platform, is this something that is going to fixed in a future fix pack? 

--
 RouteHealthDegraded: failed to GET route: dial tcp 10.124.161.68:443: connect: no route to host
WellKnownEndpointDegraded: failed to GET well-known https://10.124.161.6:6443/.well-known/oauth-authorization-server: dial tcp 10.124.161.6:6443: i/o timeout
    Reason:                MultipleConditionsMatching
--

Comment 63 errata-xmlrpc 2020-05-04 11:14:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.