1936556 – vSphere CI tcp no route to host in machine-controller

Bug 1936556 - vSphere CI tcp no route to host in machine-controller

Summary: vSphere CI tcp no route to host in machine-controller

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	jamo luhrsen
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-08 18:04 UTC by Michael Gugino
Modified:	2023-09-15 01:03 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-04 17:18:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
no route to host packet capture (305.46 KB, application/vnd.tcpdump.pcap) 2021-04-07 21:29 UTC, jamo luhrsen	no flags	Details
ssl error packet capture (567.68 KB, application/vnd.tcpdump.pcap) 2021-04-07 21:30 UTC, jamo luhrsen	no flags	Details
View All

Description Michael Gugino 2021-03-08 18:04:09 UTC

Description of problem:

machine-controller pod needs to communicate with vSphere (an IP external to the cluster itself).  It periodically has errors like:

"ci-op-xk77zmsb-98554-gjfbl-worker-gzmmg error: ci-op-xk77zmsb-98554-gjfbl-worker-gzmmg: reconciler failed to Update machine: failed to reconcile tags: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: connect: no route to host"



How reproducible:
Unknown


Expected results:
Connection issues other than "No route to host" would be expected if vSphere was overloaded.  For machine-controller pod's routes to fail implies something is wrong with the host the pod is running on or the pod networking.


Actual results:
Passing CI run with "No route" problem: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/332/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1368759166595764224

Associated logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/332/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1368759166595764224/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-84bc59b6dd-wlnzl_machine-controller.log


A failing CI run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/308/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1367635600877817856

Logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/308/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1367635600877817856/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-6df5f9998c-ntw7t_machine-controller.log


I have not seen this problem in other CI jobs.

Comment 2 jamo luhrsen 2021-03-11 18:31:06 UTC

I am working to get a nightly job running for vsphere+ovn. PR here: https://github.com/openshift/release/pull/16696
Once that is in, I'll see if I can figure anything out w/ regards to this bug.

Comment 4 Aravindh Puthiyaparambil 2021-03-19 16:15:46 UTC

This issue has increased the time for PR merges in https://github.com/openshift/windows-machine-config-operator and is slowing down development. Hence I am raising the priority and severity of this bug.

Comment 5 jamo luhrsen 2021-03-19 21:28:05 UTC

Just starting to look at this now, since I saw the priority was raised.

what I noticed right away is that 7 out of last 60 jobs are failing. these are all presubmit jobs so will be running on unmerged and possibly unreviewed PRs, so some failures could be
because of that. If you drill back in the job history, it doesn't look like this job was ever really healthy, although it was passing at a slightly higher rate if you look back
in to Feb 2021 or older.

the periodic job is now running and I see that the 4.8 passed it's first try:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-ovn/1372970806522417152

But that's a different CI deployment I guess. it's hybrid-vxlan for these windows presubmit jobs.

I also notice the main symptom in the description: "reconciler failed to Update machine: failed to reconcile tags" is showing up in both passing and
failing jobs, so not sure if that is a place to focus or not. Looking at the gather-extra logs given in the comments above.

I pushed a an empty PR on top of master to see the CI results without any changes involved just to gather data:
https://github.com/openshift/windows-machine-config-operator/pull/354

that's as far as I got today. will continue to dig on this next week.

but if this job is not healthy and mostly fails and all the PR devs are doing is /retest until they see it passing, I'd suggest removing the
job for now until it gets healthy enough to be useful.

Comment 6 Aravindh Puthiyaparambil 2021-03-19 21:37:43 UTC

@jluhrsen thanks for looking into this.

> But that's a different CI deployment I guess. it's hybrid-vxlan for these
> windows presubmit jobs.

Correct. For Windows nodes, the cluster has to use hybrid-vxlan. And for vSphere we have to use a custom VXLAN port.

> 
> I also notice the main symptom in the description: "reconciler failed to
> Update machine: failed to reconcile tags" is showing up in both passing and
> failing jobs, so not sure if that is a place to focus or not. Looking at the
> gather-extra logs given in the comments above.

That is the symptom.
 
> but if this job is not healthy and mostly fails and all the PR devs are
> doing is /retest until they see it passing, I'd suggest removing the
> job for now until it gets healthy enough to be useful.

I am afraid that is not an option. vSphere is the number 1 platform for Windows and we cannot risk a regression in that area.

If you need help bringing up a vSphere cluster or debugging the platform itself, please post on #vsphere-ci-triage.

Comment 7 Michael Gugino 2021-03-19 22:34:53 UTC

While the machine-api does seem to have some connection errors from time to time without any problems on other jobs, I'm only seeing "dial tcp 44.236.21.251:443: connect: no route to host" on this job, and I'm seeing it frequently.

There's no reason for the machine-api pod to ever receive a no route to host error.

Comment 8 Michael Gugino 2021-03-19 23:57:33 UTC

Looks like this might be caused by: https://bugzilla.redhat.com/show_bug.cgi?id=1935539

Comment 9 Michael Gugino 2021-03-20 00:25:09 UTC

Just to add some more evidence of failure, also seeing this on https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/454/pull-ci-openshift-ovn-kubernetes-master-e2e-vsphere-ovn/1372968332101160960

which is a different job.

Comment 10 Dan Winship 2021-03-22 13:04:34 UTC

(In reply to Michael Gugino from comment #8)
> Looks like this might be caused by:
> https://bugzilla.redhat.com/show_bug.cgi?id=1935539

probably not; that bug only affects traffic going over the VXLAN/Geneve tunnel (ie traffic going to a pod on another node), while this is about traffic from a pod to a cluster-external IP (which wouldn't have to be tunneled).

Comment 11 jamo luhrsen 2021-03-22 19:43:43 UTC

(In reply to Aravindh Puthiyaparambil from comment #6)

> > 
> > I also notice the main symptom in the description: "reconciler failed to
> > Update machine: failed to reconcile tags" is showing up in both passing and
> > failing jobs, so not sure if that is a place to focus or not. Looking at the
> > gather-extra logs given in the comments above.
> 
> That is the symptom.

The symptom is there in both a passing and failing job so my first guess
is that it's a red herring and not where we should focus to find the problem. Maybe I'm missing
something though. I do see that the error message is more prevalent in the failing job and
occurs even at the end of the log, so perhaps we do expect to see that type of message initially
but not after the cluster is stable.

examples:

Passing job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/354/pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-operator/1373018128086208512
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/354/pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-operator/1373018128086208512/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-58677b5586-2mt5k_machine-controller.log

Failing job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/341/pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-operator/1372957497253433344
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/341/pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-operator/1372957497253433344/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-7bb577d587-zs2v4_machine-controller.log


Looking today, I noticed that only 2 of the last 20 jobs even started tests. most were failing with something like:
"failed to fetch Master Machines: failed to load asset "Install Config": invalid "install-config.yaml" file: [platform.vsphere.apiVIP: Invalid value: "192.168.7.2": must be contained within one of the machine networks, platform.vsphere.ingressVIP: Invalid value: "192.168.7.3": must be contained within one of the machine networks]"

But that is some different problem and not realted to this bz.

I'll continue to investigate the logs and report back what I find.

Comment 12 Aravindh Puthiyaparambil 2021-03-22 20:07:05 UTC

@jluhrsen the "Install Config": invalid "install-config.yaml" file: [platform.vsphere.apiVIP: Invalid value: "192.168.7.2" is indeed a new issue like you mentioned and https://github.com/openshift/installer/pull/4779 has been opened to address that.

Comment 13 Joseph Callen 2021-03-24 13:57:53 UTC

I just hit this setting up a new cluster in VMC w/ovn


oc logs -f machine-api-controllers-777d558fc9-k8sdt -c machine-controller
...
I0324 13:49:48.290935       1 controller.go:168] jcallen-rbk8r-worker-jwmvz: reconciling Machine
I0324 13:49:48.290967       1 actuator.go:109] jcallen-rbk8r-worker-jwmvz: actuator checking if machine exists
E0324 13:50:18.297376       1 controller.go:271] jcallen-rbk8r-worker-jwmvz: failed to check if machine exists: jcallen-rbk8r-worker-jwmvz: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk": dial tcp: i/o timeout
I0324 13:50:18.297554       1 controller.go:168] jcallen-rbk8r-master-0: reconciling Machine
I0324 13:50:18.297589       1 actuator.go:109] jcallen-rbk8r-master-0: actuator checking if machine exists
E0324 13:50:48.309443       1 controller.go:271] jcallen-rbk8r-master-0: failed to check if machine exists: jcallen-rbk8r-master-0: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk": dial tcp: i/o timeout



_  /projects oc rsh machine-api-controllers-777d558fc9-k8sdt
Defaulting container name to machineset-controller.
Use 'oc describe pod/machine-api-controllers-777d558fc9-k8sdt -n openshift-machine-api' to see all of the containers in this pod.
sh-4.4$ curl https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk
^C
sh-4.4$ curl https://vcenter.sddc-44-236-21-251.vmwarevmc.com
^C
sh-4.4$ host -t A vcenter.sddc-44-236-21-251.vmwarevmc.com                                                                                                                                                                                                                    
;; connection timed out; no servers could be reached
sh-4.4$ host -t A vcenter.sddc-44-236-21-251.vmwarevmc.com
;; connection timed out; no servers could be reached
sh-4.4$ cat /etc/resolv.conf
search openshift-machine-api.svc.cluster.local svc.cluster.local cluster.local jcallen.vmc.devcluster.openshift.com
nameserver 172.30.0.10
options ndots:5
sh-4.4$

Comment 14 Michael Gugino 2021-03-24 14:38:03 UTC

dial timeout can be caused by a multitude of things.  No route to host means the kernel doesn't know where to route the traffic, which is different.  The underlying cause might be the same, but I want to focus on the no route to host in this BZ if possible.

Comment 15 jamo luhrsen 2021-03-24 21:17:05 UTC

just to update, I'm working to get dev access so that I can bring up my own vsphere cluster like we have in this job that is failing.
I also filed a PR to gather network details in that same job:
https://github.com/openshift/release/pull/17133

I have been digging through job artifacts for the past couple of days and not coming up with any answers yet.

Comment 16 jamo luhrsen 2021-03-30 21:27:03 UTC

I finally have a dev cluster that appears to exhibit the same symptoms, so will try to debug the problem from that.

[jluhrsen@ip-10-0-5-26 ~]$ oc logs machine-api-controllers-6d7d5b7dd7-chw9m machine-controller | egrep 'no route to host' | tail -n3
E0330 20:57:18.979550       1 actuator.go:57] jluhrsen-jmdxb-master-2 error: jluhrsen-jmdxb-master-2: reconciler failed to Update machine: failed to reconcile tags: Get "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/tagging/tag": dial tcp 44.236.21.251:443: connect: no route to host
E0330 20:57:18.979592       1 controller.go:299] jluhrsen-jmdxb-master-2: error updating machine: jluhrsen-jmdxb-master-2: reconciler failed to Update machine: failed to reconcile tags: Get "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/tagging/tag": dial tcp 44.236.21.251:443: connect: no route to host, retrying in 30s seconds
E0330 21:16:34.418444       1 session.go:191] Failed to logout: Delete "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: connect: no route to host
[jluhrsen@ip-10-0-5-26 ~]$

Comment 17 jamo luhrsen 2021-04-01 06:06:01 UTC

(In reply to jamo luhrsen from comment #16)
> I finally have a dev cluster that appears to exhibit the same symptoms, so
> will try to debug the problem from that.
> 
> [jluhrsen@ip-10-0-5-26 ~]$ oc logs machine-api-controllers-6d7d5b7dd7-chw9m
> machine-controller | egrep 'no route to host' | tail -n3
> E0330 20:57:18.979550       1 actuator.go:57] jluhrsen-jmdxb-master-2 error:
> jluhrsen-jmdxb-master-2: reconciler failed to Update machine: failed to
> reconcile tags: Get
> "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/
> tagging/tag": dial tcp 44.236.21.251:443: connect: no route to host
> E0330 20:57:18.979592       1 controller.go:299] jluhrsen-jmdxb-master-2:
> error updating machine: jluhrsen-jmdxb-master-2: reconciler failed to Update
> machine: failed to reconcile tags: Get
> "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/
> tagging/tag": dial tcp 44.236.21.251:443: connect: no route to host,
> retrying in 30s seconds
> E0330 21:16:34.418444       1 session.go:191] Failed to logout: Delete
> "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/
> session": dial tcp 44.236.21.251:443: connect: no route to host
> [jluhrsen@ip-10-0-5-26 ~]$


while debugging, I noticed that I was not able to hit the vcenter endpoint (curl and/or wget) from the machine-controller container, but I was also seeing similar symptoms from
the node itself as well as the bastion system. This made me think there is something getting in the way outside of the cluster. I ended up re-deploying the cluster
and accidentally using openshift-sdn and access to that vcenter system was working from all places (machine-controller container, host node, and bastion system).

Is there any networking config happening outside the of the cluster when deploying this setup that could be getting in the way when it's OVN + hybrid vxlan?

The traffic to hit that endpoint should be leaving the node on br-ex and I'm not sure what would happen beyond that to block it, but that's what I saw.

I was attempting to bring back up an OVN cluster to check again if we lose access to the vcenter, but running in to some infra issues I think. Errors like this:

E0401 05:59:16.526705 1789814 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://api.jluhrsen.vmc.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&resourceVersion=206": dial tcp 172.31.250.83:6443: i/o timeout
I0401 06:00:13.035518 1789814 trace.go:205] Trace[1270937323]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (01-Apr-2021 05:59:43.034) (total time: 30000ms):
Trace[1270937323]: [30.000576435s] [30.000576435s] END

and I saw some slack conversations around the same.

I'd like to figure out why these deployments stopped working so I can check OVN vs openshift-sdn again to see if this is happening consistently.

Comment 18 jamo luhrsen 2021-04-07 21:28:52 UTC

more clues to what the actual problem could be:

The 'no route to host' is happening in both an OVNKubernetes and OpenShiftSDN deployment. I can reproduce the symptom directly
on a node itself so it's not anything specific to the plumbing to a pod. Also, intermittently we will see the i/o timeout.

When the timeout occurs it's because a new connection is sending new requests (SYNs) to open a connection, doesn't hear back and
retries until it gives up.

when the 'no route to host' is seen, it's because we are getting an ICMP timeout-exceeded from some intermediate system
in AWS. like this:

20:56:03.398196 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com > ip-172-31-251-175.us-west-2.compute.internal: ICMP time exceeded in-transit, length 36

The test I'm running is to run tcpdump on the vcenter host ip for the egress interface (ens192 for sdn, and br-ex for ovn) and run curl
in a loop to the vcenter ip and exit on non-zero exit code:

  $ tcpdump -ni ens192 host 44.236.21.251 -w dead_vcenter3.pcap &
  $ while curl  https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session; do sleep 1; done

This seems like it would be some problem external to the deployment and possibly inside AWS? I don't know how to debug any further than
this.

Also, the curl test will run for many minutes (5+) sometimes... maybe longer. sometimes the 'no route to host' will come back to
back to back... other times, you see it once and not again for another iteration. I don't see a pattern yet, and this follows what
we are seeing in our CI jobs (some pass, some fail, some have more of these logs, some less).



As final note, once I even saw this error:
20:56:03.398196 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com > ip-172-31-251-175.us-west-2.compute.internal: ICMP time exceeded in-transit, length 36

which is a RST in the middle of an open connection:
21:00:30.880338 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [S], seq 4162534308, win 29200, options [mss 1460,sackOK,TS val 1691328097 ecr 0,nop,wscale 7], length 0
21:00:31.940203 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [S], seq 4162534308, win 29200, options [mss 1460,sackOK,TS val 1691329157 ecr 0,nop,wscale 7], length 0
21:00:33.988199 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [S], seq 4162534308, win 29200, options [mss 1460,sackOK,TS val 1691331205 ecr 0,nop,wscale 7], length 0
21:00:38.020221 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [S], seq 4162534308, win 29200, options [mss 1460,sackOK,TS val 1691335237 ecr 0,nop,wscale 7], length 0
21:00:42.083858 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https > ip-172-31-251-175.us-west-2.compute.internal.41436: Flags [S.], seq 3814849712, ack 4162534309, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 8], length 0
21:00:42.083910 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [.], ack 1, win 229, length 0
21:00:42.092308 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517
21:00:42.340207 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517
21:00:42.588198 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517
21:00:43.076218 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517
21:00:44.100236 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517
21:00:46.084194 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517
21:00:49.988185 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517
21:00:57.860213 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517
21:00:57.861294 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https > ip-172-31-251-175.us-west-2.compute.internal.41436: Flags [.], ack 518, win 119, options [nop,nop,sack 1 {1:518}], length 0
21:01:58.276193 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [.], ack 1, win 229, length 0
21:01:58.277408 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https > ip-172-31-251-175.us-west-2.compute.internal.41436: Flags [.], ack 518, win 119, length 0
21:03:00.228204 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [.], ack 1, win 229, length 0
21:03:00.229362 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https > ip-172-31-251-175.us-west-2.compute.internal.41436: Flags [R], seq 3814849713, win 0, length 0


pcaket captures are attached for the 'no route to host' and ssl error cases. I lost the i/o timeout packet capture when I forgot to offload it between
deployments. if that's important, I can try to get it.

Comment 19 jamo luhrsen 2021-04-07 21:29:53 UTC

Created attachment 1770031 [details]
no route to host packet capture

Comment 20 jamo luhrsen 2021-04-07 21:30:16 UTC

Created attachment 1770032 [details]
ssl error packet capture

Comment 21 jamo luhrsen 2021-04-08 18:11:30 UTC

@jcallen has opened a ticket with vmware to try to get to the bottom of this:
https://console.cloud.vmware.com/csp/gateway/portal/#/support/21211690804

Comment 22 jamo luhrsen 2021-04-08 18:13:28 UTC

@aravindh, @mgugino, closing this now that we have a ticket with vmware. please re-open if you think we should.

Comment 23 Michael Gugino 2021-04-08 18:27:44 UTC

I'd prefer this stays open until we've resolved the issue.

Comment 25 jamo luhrsen 2021-05-03 23:47:37 UTC

There was some work on the AWS side that seems to have resolved this issue. There was no progress getting this fixed on the VMWare side.
@scuppett can comment on the changes made.


The last time this occured was 4/26:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/413/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1386796399454064640/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-574864c947-lb4cv_machine-controller.log

Looking at the jobs since 4/29 that ended up with machine-controller logs (6 of them as not all jobs are pulling those logs successfullly), this 'no route to host' is no longer seen:


https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/426/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1387875567482703872/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-6bc6848bf7-4w2dx_machine-controller.log
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/428/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1387918067857625088/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-846d7f9f56-lbmls_machine-controller.log
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/428/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1388128528104427520/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-66fd9b4c9b-wp4nj_machine-controller.log
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/428/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1388202918317920256/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-9ff8d998b-cg6qn_machine-controller.log
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/430/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1388224054971863040/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-66b6fff874-4fz25_machine-controller.log
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/433/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1389251847587368960/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-5f8497d755-2t4r5_machine-controller.log

Ok to close this now @mgugino ?

Comment 26 Stephen Cuppett 2021-05-04 17:18:06 UTC

To provide private resolution from the same endpoint, we ended up creating a private hosted zone in Route53.

1) Existing: DNS resolution inside VMC was using 10.0.0.2 (AWS VPC DNS)
2) Created private hosted zone for the SDDC domain and associated it to the VPC: sddc-##-###-##-###.vmwarevmc.com where ## represents the numerics of the actual, public IP in the zone.
3) Created record for the private vCenter address: vcenter.sddc-##-###-##-###.vmwarevmc.com -> 10.3.224.4

Now, VMs inside either VMC or anywhere else inside the VPC will use the 10.3.224.4 address of vcenter.

This will need changed/recreated when SDDC is destroyed/recreated or if the vcenter address ever changes (we'll see how frequently that is before we automate any of it, I'd imagine we can query some APIs with a similar lambda to what we use to manage the route tables).

Comment 27 Red Hat Bugzilla 2023-09-15 01:03:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.