Bug 1845545
Summary: | Communication with services sometimes leads to a refused connection | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Eli <belimele> |
Component: | Networking | Assignee: | Andrew McDermott <amcdermo> |
Networking sub component: | router | QA Contact: | Hongan Li <hongli> |
Status: | CLOSED NOTABUG | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | aos-bugs, bbennett, belimele, ebenahar, jesusr, mmasters, nbecker, ocs-bugs, omitrani |
Version: | 4.4 | ||
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-06-16 06:38:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1842252 |
Description
Ben Eli
2020-06-09 13:37:11 UTC
Are you hitting a route? And how is your external DNS set up... we don't own the name resolution of that name, and it is resolving to a bad address. Hi Ben, Maybe I can be of any help. The issue we encountered is that the OpenShift route name is resolved to a node that seems to not host the required component that handles OpenShift routes (So we get connection refuse). It does not seem to be a name resolution issue because the name is resolved to an IP properly and I can trace the IP to one of the worker nodes in the cluster. Now, we have 3 workers in the cluster. Whenever we hit any of the other two everything is working just fine. I am not sure which component in OpenShift is servicing the requests for OpenShift routes but it seems like this worker does not run that component or that that component cannot accept requests. Let me clarify another thing: The problem is not with the service choosing a proper pod. I can tell that because the communication does not reach to the target service at all (the one the route is pointing at) Is it possible that whatever component that is responsible for servicing OpenShift routes is not installed correctly on this specific worker? If so, how can we verify all of this? (In reply to Ben Bennett from comment #2) > Are you hitting a route? > > And how is your external DNS set up... we don't own the name resolution of > that name, and it is resolving to a bad address. It's unclear to me why this BZ was moved to 4.6. Can you add some clarification? (In reply to Ohad from comment #3) > [...] > The issue we encountered is that the OpenShift route name is resolved to a > node that seems to not host the required component that handles OpenShift > routes (So we get connection refuse). > It does not seem to be a name resolution issue because the name is resolved > to an IP properly and I can trace the IP to one of the worker nodes in the > cluster. > > Now, we have 3 workers in the cluster. Whenever we hit any of the other two > everything is working just fine. > I am not sure which component in OpenShift is servicing the requests for > OpenShift routes but it seems like this worker does not run that component > or that that component cannot accept requests. > > Let me clarify another thing: > The problem is not with the service choosing a proper pod. I can tell that > because the communication does not reach to the target service at all (the > one the route is pointing at) > > Is it possible that whatever component that is responsible for servicing > OpenShift routes is not installed correctly on this specific worker? > If so, how can we verify all of this? The component that handles these connections is the ingress controller (also called the "router"). By default, the ingress controller has 2 pod replicas, running on 2 worker nodes. If you run `oc -n openshift-ingress-operator get ingresscontrollers/default -o yaml`, you should see "replicas: 2" in the object's spec. This is why 2 worker nodes accept connections and 1 worker node does not. I believe `oc -n openshift-ingress get pods -o wide` will list the pods along with the nodes on which they are running. Normally a cluster has a load balancer in front of the ingress controller. On vSphere, you need to configure the load balancer as documented here: https://docs.openshift.com/container-platform/4.4/installing/installing_vsphere/installing-vsphere.html#network-topology-requirements It looks like your *.apps.belimele-dc26.qe.rh-ocs.com DNS record is configured to resolve to the worker nodes, but the DNS record should point to the load balancer, as documented here: https://docs.openshift.com/container-platform/4.4/installing/installing_vsphere/installing-vsphere.html#installation-dns-user-infra_installing-vsphere The load balancer should perform health probes so that it forwards only to the nodes that are actually running the ingress controller. (I do not have experience with vSphere, so I do not know the detailed configuration steps, but I hope this is the default behavior or straightforward to configure.) Does that resolve the issue? (In reply to Yaniv Kaul from comment #4) > [...] > It's unclear to me why this BZ was moved to 4.6. Can you add some > clarification? The issue is reported against 4.4. In general, if we need to fix something in the product, we need to fix it in the upcoming release (4.6) first, and then if it needs to be fixed in earlier releases too, we can file additional Bugzilla reports and backport the fix accordingly. Thank you so much, Miciah. With your great explanation, I was able to find out that it was indeed a misconfiguration on our side. I checked the DNS records, and just like you said - in our vSphere, they point directly to all worker nodes (instead of an LB with a prior health check), while in our AWS we do have an LB with health probes, that rightfully reports only 2 of the workers as healthy.) Closing as NOTABUG. |