Description of problem: The authentication operator will sometimes report the following degraded condition: RouteHealthDegraded: failed to GET route: dial tcp <ip>:443: connect: connection refused Observed on the following platforms in CI over the past 14 days: gcp The nature of the error (which looks like an external IP) and the fact that it has only been observed on GCP seem like clues. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Observed on Azure: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/5567/rehearse-5567-pull-ci-openshift-cloud-credential-operator-master-e2e-azure/6 The general failure mode seems to be installer failure: level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-x3hpts6c-395e7.ci.azure.devcluster.openshift.com:6443 to initialize..." level=error msg="Cluster operator authentication Degraded is True with RouteHealthDegradedFailedGet: RouteHealthDegraded: failed to GET route: dial tcp 52.230.227.98:443: connect: connection refused"
https://ci-search-ci-search-next.svc.ci.openshift.org/?search=RouteHealthDegraded%3A+failed+to+GET+route.*connection+refused&maxAge=336h&context=2&type=all
I think what's happening here is the auth operator is scheduled to a master, and the master has no iptables rules which would enable a pod to route to the external IP of a load balancer service. This would indicate that LoadBalancer Services generally (and thus Route or Ingress) from a master are not working when this error manifests. I'm going to copy Clayton who has been tracking some of the issues in the area of master/load balancer interactions.
Forgot to mention, I got evidence from: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console-operator/325/pull-ci-openshift-console-operator-master-e2e-gcp/40 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.3/206
Abhinav noted that the connection refused error is especially strange, and could indicate the traffic is actually making it to the load balancer. Normally, these requests would remain cluster-local through iptables rules, but if the client request originates from the master with the iptables rules noted, maybe the packet does get refused by the LB for some reason (or maybe the LB routes it back to the node, which could reject). Next step is to reproduce these scenarios on GCP.
On my new 4.3.0-0.okd-2019-10-28-235941 GCP cluster I'm seeing the correct chains on my masters and routing is fine e.g. -A KUBE-SERVICES -d 35.190.186.118/32 -p tcp -m comment --comment "openshift-ingress/router-default:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-MBAZS3WDHL45BPIZ -A KUBE-SERVICES -d 35.190.186.118/32 -p tcp -m comment --comment "openshift-ingress/router-default:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-HEVFQXAKPPGAL4BV Based on the runs I've analyzed, when the issue happens in CI, these rules are missing entirely from the masters. If I manually delete the https rule from my master, I can reproduce the connection refused error: [root@dmace-m5kzn-m-0 /]# iptables -t nat -D KUBE-SERVICES -d 35.190.186.118/32 -p tcp -m comment --comment "openshift-ingress/router-default:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-MBAZS3WDHL45BPIZ [root@dmace-m5kzn-m-0 /]# curl -k https://35.190.186.118 curl: (7) Failed to connect to 35.190.186.118 port 443: Connection refused I'm now convinced this is an SDN issue. On GCP and Azure, pod requests to a cloud LB through a LoadBalancer Service should remain local to the cluster and not actually reach the load balancer.
Interesting! Before we go crazy in to this, is there any chance kube-proxy is doing the right thing and not configuring the rules based on any TrafficPolicy?
In all cases, routers are deployed on workers. On new clusters, the rules are present on masters. And sometimes, the rules disappear from masters. Given how the routers are always on workers in these cases, I'm not sure how the inconsistent application of rules would ever be correct. Given the inputs, wouldn't it be reasonable for the rule to either always exist or not on the masters?
I wonder if it's because we're using ExternalTrafficPolicy=local and not setting an external IP: "spec": { "clusterIP": "172.30.241.255", "externalTrafficPolicy": "Local", "healthCheckNodePort": 31858, "ports": [ { "name": "http", "nodePort": 31921, "port": 80, "protocol": "TCP", "targetPort": "http" }, { "name": "https", "nodePort": 31928, "port": 443, "protocol": "TCP", "targetPort": "https" } ], "selector": { "ingresscontroller.operator.openshift.io/deployment-ingresscontroller": "default" }, "sessionAffinity": "None", "type": "LoadBalancer" },
Wow, this code is complicated. I imagine you're right, Dan. If these endpoints appear then disappear, then it's unlikely that they should.
One node has weird iptables rules: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/5567/rehearse-5567-pull-ci-openshift-cloud-credential-operator-master-e2e-azure/6/artifacts/e2e-azure/network/iptables-save-sdn-tz2vf However, looking at the logs for this pod, it doesn't seem to miss the service (we don't log every possible event). So it's not obvious what the issue is. Either way, the first step is to merge the known informer bug.
*** Bug 1766664 has been marked as a duplicate of this bug. ***
Moving this to high severity since this seems to be causing clusters to fail to install (in my experience on GCP): https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24048/pull-ci-openshift-origin-master-e2e-gcp-upgrade/412 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24048/pull-ci-openshift-origin-master-e2e-gcp-builds/400
We're hoping that https://github.com/openshift/sdn/pull/60 can fix this.
*** Bug 1768256 has been marked as a duplicate of this bug. ***
Just hit this in a CI run: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1244/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/302 Is a fix expected?
Sounds like that didn't fix it. Nuts. I've asked Aniket to file a temporary PR to add more logging. We'll revert it before we ship. Then we can have a hope of debugging this.
Also hit the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1768256 when trigger installation on upi/gcp. So tracked it in the bug. version: 4.3.0-0.nightly-2019-11-10-152059 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 69m Unable to apply 4.3.0-0.nightly-2019-11-10-152059: the cluster operator console has not yet successfully rolled out # oc logs pod/console-67c588bdd9-lp45j -n openshift-console 2019/11/11 07:23:58 cmd/main: cookies are secure! 2019/11/11 07:23:58 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.jliu-bug.qe.gcp.devcluster.openshift.com/oauth/token failed: Head https://oauth-openshift.apps.jliu-bug.qe.gcp.devcluster.openshift.com: dial tcp 35.222.207.160:443: connect: connection refused 2019/11/11 07:24:08 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.jliu-bug.qe.gcp.devcluster.openshift.com/oauth/token failed: Head https://oauth-openshift.apps.jliu-bug.qe.gcp.devcluster.openshift.com: dial tcp 35.222.207.160:443: connect: connection refused
*** Bug 1771732 has been marked as a duplicate of this bug. ***
Assigning to Aniket, who has been looking at this. Alexander, can you help him get to the bottom of this. Summary: Node iptables rules should also have the public IP endpoints of load balancers (they act like another ClusterIP). However, there seem to be bugs where this occasionally is not working on a single (or so) node. So it seems like it's missing the update where the service's load balancer IPs are set.
The question: is this a weird informer issue, or is there somehow a bug in the code that can miss the no load balancer -> load balancer transition.
Another failure due to this: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/358/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/153 the IP is missing from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/358/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/153/artifacts/e2e-gcp/network/iptables-save-sdn-2j4lw { "apiVersion": "v1", "kind": "Service", "metadata": { "creationTimestamp": "2019-11-12T18:20:38Z", "finalizers": [ "ingress.openshift.io/operator", "service.kubernetes.io/load-balancer-cleanup" ], "labels": { "app": "router", "ingresscontroller.operator.openshift.io/owning-ingresscontroller": "default", "router": "router-default" }, "name": "router-default", "namespace": "openshift-ingress", "ownerReferences": [ { "apiVersion": "apps/v1", "controller": true, "kind": "Deployment", "name": "router-default", "uid": "fa7f5a3e-00c6-4fdc-b3bc-ea2c2d92f7b4" } ], "resourceVersion": "10525", "selfLink": "/api/v1/namespaces/openshift-ingress/services/router-default", "uid": "3c79dfae-31bc-44e4-a1b2-32cb0fc2cb09" }, "spec": { "clusterIP": "172.30.111.81", "externalTrafficPolicy": "Local", "healthCheckNodePort": 31437, "ports": [ { "name": "http", "nodePort": 31504, "port": 80, "protocol": "TCP", "targetPort": "http" }, { "name": "https", "nodePort": 31431, "port": 443, "protocol": "TCP", "targetPort": "https" } ], "selector": { "ingresscontroller.operator.openshift.io/deployment-ingresscontroller": "default" }, "sessionAffinity": "None", "type": "LoadBalancer" }, "status": { "loadBalancer": { "ingress": [ { "ip": "35.231.203.32" } ] } } }, {
*** Bug 1762139 has been marked as a duplicate of this bug. ***
*** Bug 1772188 has been marked as a duplicate of this bug. ***
Got a test failure for this: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24147/pull-ci-openshift-origin-master-e2e-gcp-builds/508 https://storage.googleapis.com/origin-ci-test/pr-logs/pull/24147/pull-ci-openshift-origin-master-e2e-gcp-builds/508/artifacts/e2e-gcp-builds/network/iptables-save-sdn-wqzvl is missing the IP SDN logs in https://storage.googleapis.com/origin-ci-test/pr-logs/pull/24147/pull-ci-openshift-origin-master-e2e-gcp-builds/508/artifacts/e2e-gcp-builds/pods/openshift-sdn_sdn-wqzvl_sdn.log With the added informer logging, we do see the correct event: I1114 11:22:23.445081 4033 informers.go:145] UPDATE &Service{ObjectMeta:{router-default openshift-ingress /api/v1/namespaces/openshift-ingress/services/router-default d2a33c04-e25f-443c-b102-7b9987229f21 9659 0 2019-11-14 11:21:34 +0000 UTC <nil> <nil> map[app:router ingresscontroller.operator.openshift.io/owning-ingresscontroller:default router:router-default] map[] [{apps/v1 Deployment router-default bd88075a-4793-4f94-a8bd-138783e7d5d9 0xc000dd0957 <nil>}] [ingress.openshift.io/operator service.kubernetes.io/load-balancer-cleanup] []},Spec:ServiceSpec{Ports:[]ServicePort{ServicePort{Name:http,Protocol:TCP,Port:80,TargetPort:{1 0 http},NodePort:30902,},ServicePort{Name:https,Protocol:TCP,Port:443,TargetPort:{1 0 https},NodePort:32757,},},Selector:map[string]string{ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default,},ClusterIP:172.30.211.249,Type:LoadBalancer,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:Local,HealthCheckNodePort:30071,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{},},},} -> &Service{ObjectMeta:{router-default openshift-ingress /api/v1/namespaces/openshift-ingress/services/router-default d2a33c04-e25f-443c-b102-7b9987229f21 10697 0 2019-11-14 11:21:34 +0000 UTC <nil> <nil> map[app:router ingresscontroller.operator.openshift.io/owning-ingresscontroller:default router:router-default] map[] [{apps/v1 Deployment router-default bd88075a-4793-4f94-a8bd-138783e7d5d9 0xc000eb56f7 <nil>}] [ingress.openshift.io/operator service.kubernetes.io/load-balancer-cleanup] []},Spec:ServiceSpec{Ports:[]ServicePort{ServicePort{Name:http,Protocol:TCP,Port:80,TargetPort:{1 0 http},NodePort:30902,},ServicePort{Name:https,Protocol:TCP,Port:443,TargetPort:{1 0 https},NodePort:32757,},},Selector:map[string]string{ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default,},ClusterIP:172.30.211.249,Type:LoadBalancer,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:Local,HealthCheckNodePort:30071,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{LoadBalancerIngress{IP:35.185.124.124,Hostname:,},},},},} Time to see why this isn't being handled properly.
Assigning back to Casey.
Here's another recent failure now that we have more logging: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console/3390/pull-ci-openshift-console-master-e2e-gcp-console/3149
Filed https://github.com/openshift/sdn/pull/71 to get more logging.
Update: check out https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_sdn/71/pull-ci-openshift-sdn-master-e2e-gcp/11/artifacts/e2e-gcp/pods/openshift-sdn_sdn-mddrk_sdn.log at 19:58:39.348317 the informer debugger gets the update that adds the loadbalancer IP, but not the service Update method. So either there is a bug in informers (very unlikely) or the hybrid proxier is broken. Let's test this theory.
With extra logging added (but no logic changes), CI passed twice. Either I've gotten unlucky, or this is a very sensitive heisenbug.
I think we hit this bug here: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.3/415
*** Bug 1751916 has been marked as a duplicate of this bug. ***
seeing it at https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3/161 as well.
This seems similar as well: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1285/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/550
One-off "seen here"'s are probably not all that useful. It's less anecdotal to post flake rates like: 35 of these in the past 24h, showing up in 2% of our failed e2e jobs [1]. [1]: https://search.svc.ci.openshift.org/chart?search=RouteHealthDegraded:%20failed%20to%20GET%20route.*connection%20refused
I've filed bug 1779429 about the "TLS handshake timeout" form of RouteHealthDegraded. The job I cite there also saw an EOF form. Dunno if they have the same underlying cause or not.
*** Bug 1779710 has been marked as a duplicate of this bug. ***
(In reply to W. Trevor King from comment #38) > One-off "seen here"'s are probably not all that useful. It's less anecdotal > to post flake rates like: 35 of these in the past 24h, showing up in 2% of > our failed e2e jobs [1]. > > [1]: > https://search.svc.ci.openshift.org/chart?search=RouteHealthDegraded: > %20failed%20to%20GET%20route.*connection%20refused It's actually much more frequent than this because the same underlying problem can manifest with other errors as in bug 1779710. We're seeing this extremely frequently in console CI. https://search.svc.ci.openshift.org/chart?search=Cluster%20operator%20console%20Available%20is%20False%20with%20DeploymentAvailableFailedUpdate
Got another reproduction: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_sdn/74/pull-ci-openshift-sdn-master-e2e-gcp/46/artifacts/e2e-gcp/pods/openshift-sdn_sdn-8w8qt_sdn.log Extremely disconcerting. Added logging to the proxy handler as well as a separate debug handler. Here are the timelines of updates to router-default as well as the Old and New ResourceVersion: proxy: 15:39:55.117801 0 -> 12628 15:39:55.143645 12628 -> 12636 15:41:48.870121 13818 -> 13818 BUG! informer: 15:39:55.118055 0 -> 12628 15:39:55.144002 12628 -> 12636 15:40:09.501553 12636 -> 12636 15:40:39.502596 12636 -> 12636 15:40:48.880045 12636 -> 13818 (IP added here) 15:41:18.874704 13818 -> 13818 So, the missed update was around 15:40:48.880045. Something of interest: Just before this point in time the apiserver must have restarted (maybe transition off of bootstrap): E1205 15:40:47.831537 4830 reflector.go:280] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: unknown (get services)
Another even crazier one: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_sdn/74/pull-ci-openshift-sdn-master-e2e-gcp/48/artifacts/e2e-gcp/pods/openshift-sdn_sdn-fx282_sdn.log Our debug informer gets the ADD, but the proxy does not. Just to test a thesis, I'm going to disable the hybrid proxy and try again.
Okay, I have a better idea of what is going on, and it doesn't look like the hybrid proxy is (directly) at fault. Let's look at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_sdn/74/pull-ci-openshift-sdn-master-e2e-gcp/54/artifacts/e2e-gcp/pods/openshift-sdn_sdn-hm9lx_sdn.log There is a single SharedInformer, with two handlers: a simple debugging one, and the proxy itself. At this point in time, the logging calls I've added are all in upstream code, so it's not the hybrid proxier at work here. I see a period of time where the debugging handler gets more calls than the proxy handler. This always seems to follow apiserver disruption (i.e. bootstrap or new config rollout). For example: 07:52:53.707847 <- last call to the proxy handler (api server disruption, reconnection) Lots of calls to debug informer UPDATE(service) 07:53:39.556409 <- next call
Changelog for the buggy service: debug: 07:52:19.951681 0 -> 15303 07:52:20.013841 15303 -> 15313 07:52:36.589822 15313 -> 15313 07:53:06.591883 15313 -> 15313 07:53:28.806916 15313 -> 17133 07:53:58.684469 17133 -> 17133 proxy: 07:52:19.951681 0 -> 15303 07:52:20.013241 15303 -> 15313 07:54:28.684664 17133 -> 17133
We think we've figured this out: https://github.com/openshift/sdn/pull/74/commits/5cf82fdf11fbb3fea3a18d4245a55838ebff5b65 TL;DR: shared informer bug. Yikes.
Filed https://github.com/kubernetes/kubernetes/pull/86015 to fix the client-go bug.
*** Bug 1781727 has been marked as a duplicate of this bug. ***
Fix merged in https://github.com/openshift/openshift-sdn/pull/79. Starting the backport dance.
meant https://github.com/openshift/sdn/pull/79
Deployment succeeded on GCP with 4.4.0-0.nightly-2019-12-13-170401 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2019-12-13-170401 True False False 19m
*** Bug 1752358 has been marked as a duplicate of this bug. ***
Hello There, we are running Openshift 4.2.20 on Azure cloud and the "Authentication" operator is in a degraded state resulting in login problems for the applications we are running on the platform, is this something that is going to fixed in a future fix pack? -- RouteHealthDegraded: failed to GET route: dial tcp 10.124.161.68:443: connect: no route to host WellKnownEndpointDegraded: failed to GET well-known https://10.124.161.6:6443/.well-known/oauth-authorization-server: dial tcp 10.124.161.6:6443: i/o timeout Reason: MultipleConditionsMatching --
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581