Bug 1845545

Summary:	Communication with services sometimes leads to a refused connection
Product:	OpenShift Container Platform	Reporter:	Ben Eli <belimele>
Component:	Networking	Assignee:	Andrew McDermott <amcdermo>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aos-bugs, bbennett, belimele, ebenahar, jesusr, mmasters, nbecker, ocs-bugs, omitrani
Version:	4.4
Target Milestone:	---
Target Release:	4.6.0
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-16 06:38:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1842252

Description Ben Eli 2020-06-09 13:37:11 UTC

Description of problem:
When trying to communicate with certain service endpoints, the response is sometimes "ECONNREFUSED".
This happens on a vSphere cluster, but we also have cases where it seems to have happened on AWS clusters as well (still being tested).

I and Ohad Mitrani tested two services in particular - the NooBaa S3 service (part of OCS), and the oauth-openshift service.
In both cases, when we tested the service endpoints with `curl`, we seemed to be redirected to different IPs each time - which hints at some sort of a load balancer.

Example of the issue:
curl -v --insecure s3-openshift-storage.apps.belimele-dc26.qe.rh-ocs.com
* Rebuilt URL to: s3-openshift-storage.apps.belimele-dc26.qe.rh-ocs.com/
* Trying 10.46.27.65...
* TCP_NODELAY set
* connect to 10.46.27.65 port 80 failed: Connection refused
* Trying 10.46.27.63...
* TCP_NODELAY set
* Connected to s3-openshift-storage.apps.belimele-dc26.qe.rh-ocs.com (10.46.27.63) port 80 (#0)

The IP address above (10.46.27.65) consistently refuses connections, and whenever an attempt to connect to it is made, curl tries other IPs (and subsequently succeeds). However, some mechanics in NooBaa fail entirely after a refused connection - leading to unexpected behaviors.
One example is the NooBaa login, which hangs because of the refused connection - described in BZ 1842252.

The pool of available IPs seems to be made up of three addresses - each corresponding to a worker node in the cluster.

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP
compute-0 Ready worker 15d v1.17.1 10.46.27.62 10.46.27.62
compute-1 Ready worker 15d v1.17.1 10.46.27.65 10.46.27.65
compute-2 Ready worker 15d v1.17.1 10.46.27.63 10.46.27.63
control-plane-0 Ready master 15d v1.17.1 10.46.27.54 10.46.27.54
control-plane-1 Ready master 15d v1.17.1 10.46.27.64 10.46.27.64
control-plane-2 Ready master 15d v1.17.1 10.46.27.61 10.46.27.61

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-05-25-020741

How reproducible:
100% on my existing cluster, not yet tested on others.

Steps to Reproduce:
1. `oc rsh` into `noobaa-core-0` on the `openshift-storage` namespace (any other pod should also work)
2. Run `curl -v --insecure <endpoint>` - we tested it by using the NooBaa S3 endpoint (s3-openshift-storage.apps.belimele-dc26.qe.rh-ocs.com in my case) and the oauth endpoint - oauth-openshift.apps.belimele-dc26.qe.rh-ocs.com)
3. See that sometimes, one of the IPs consistently refuses connections

Actual results:
Connections are being refused

Expected results:
Connections go through

Additional info:
I am not yet entirely sure whether this might be a configuration issue (I did not change any setting related to this as far as I know) - but I would be glad to share my cluster for inspection purposes, to try and understand why this particular IP refuses to connect.

Comment 2 Ben Bennett 2020-06-10 13:22:51 UTC

Are you hitting a route?

And how is your external DNS set up... we don't own the name resolution of that name, and it is resolving to a bad address.

Comment 3 Ohad 2020-06-10 14:15:51 UTC

Hi Ben,

Maybe I can be of any help.

The issue we encountered is that the OpenShift route name is resolved to a node that seems to not host the required component that handles OpenShift routes (So we get connection refuse).
It does not seem to be a name resolution issue because the name is resolved to an IP properly and I can trace the IP to one of the worker nodes in the cluster.

Now, we have 3 workers in the cluster. Whenever we hit any of the other two everything is working just fine.
I am not sure which component in OpenShift is servicing the requests for OpenShift routes but it seems like this worker does not run that component or that that component cannot accept requests.

Let me clarify another thing: 
The problem is not with the service choosing a proper pod. I can tell that because the communication does not reach to the target service at all (the one the route is pointing at)

Is it possible that whatever component that is responsible for servicing OpenShift routes is not installed correctly on this specific worker?
If so, how can we verify all of this?

Comment 4 Yaniv Kaul 2020-06-10 17:12:05 UTC

(In reply to Ben Bennett from comment #2)
> Are you hitting a route?
> 
> And how is your external DNS set up... we don't own the name resolution of
> that name, and it is resolving to a bad address.

It's unclear to me why this BZ was moved to 4.6. Can you add some clarification?

Comment 5 Miciah Dashiel Butler Masters 2020-06-15 23:16:16 UTC

(In reply to Ohad from comment #3)
> [...]
> The issue we encountered is that the OpenShift route name is resolved to a
> node that seems to not host the required component that handles OpenShift
> routes (So we get connection refuse).
> It does not seem to be a name resolution issue because the name is resolved
> to an IP properly and I can trace the IP to one of the worker nodes in the
> cluster.
> 
> Now, we have 3 workers in the cluster. Whenever we hit any of the other two
> everything is working just fine.
> I am not sure which component in OpenShift is servicing the requests for
> OpenShift routes but it seems like this worker does not run that component
> or that that component cannot accept requests.
> 
> Let me clarify another thing: 
> The problem is not with the service choosing a proper pod. I can tell that
> because the communication does not reach to the target service at all (the
> one the route is pointing at)
> 
> Is it possible that whatever component that is responsible for servicing
> OpenShift routes is not installed correctly on this specific worker?
> If so, how can we verify all of this?

The component that handles these connections is the ingress controller (also called the "router").  By default, the ingress controller has 2 pod replicas, running on 2 worker nodes.  If you run `oc -n openshift-ingress-operator get ingresscontrollers/default -o yaml`, you should see "replicas: 2" in the object's spec.  This is why 2 worker nodes accept connections and 1 worker node does not.  I believe `oc -n openshift-ingress get pods -o wide` will list the pods along with the nodes on which they are running.

Normally a cluster has a load balancer in front of the ingress controller.  On vSphere, you need to configure the load balancer as documented here:  https://docs.openshift.com/container-platform/4.4/installing/installing_vsphere/installing-vsphere.html#network-topology-requirements

It looks like your *.apps.belimele-dc26.qe.rh-ocs.com DNS record is configured to resolve to the worker nodes, but the DNS record should point to the load balancer, as documented here:  https://docs.openshift.com/container-platform/4.4/installing/installing_vsphere/installing-vsphere.html#installation-dns-user-infra_installing-vsphere

The load balancer should perform health probes so that it forwards only to the nodes that are actually running the ingress controller.  (I do not have experience with vSphere, so I do not know the detailed configuration steps, but I hope this is the default behavior or straightforward to configure.)

Does that resolve the issue?

(In reply to Yaniv Kaul from comment #4)
> [...]
> It's unclear to me why this BZ was moved to 4.6. Can you add some
> clarification?

The issue is reported against 4.4.  In general, if we need to fix something in the product, we need to fix it in the upcoming release (4.6) first, and then if it needs to be fixed in earlier releases too, we can file additional Bugzilla reports and backport the fix accordingly.

Comment 6 Ben Eli 2020-06-16 06:38:13 UTC

Thank you so much, Miciah. 
With your great explanation, I was able to find out that it was indeed a misconfiguration on our side.
I checked the DNS records, and just like you said - in our vSphere, they point directly to all worker nodes (instead of an LB with a prior health check), while in our AWS we do have an LB with health probes, that rightfully reports only 2 of the workers as healthy.)

Closing as NOTABUG.