Description of problem: machine-controller pod needs to communicate with vSphere (an IP external to the cluster itself). It periodically has errors like: "ci-op-xk77zmsb-98554-gjfbl-worker-gzmmg error: ci-op-xk77zmsb-98554-gjfbl-worker-gzmmg: reconciler failed to Update machine: failed to reconcile tags: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: connect: no route to host" How reproducible: Unknown Expected results: Connection issues other than "No route to host" would be expected if vSphere was overloaded. For machine-controller pod's routes to fail implies something is wrong with the host the pod is running on or the pod networking. Actual results: Passing CI run with "No route" problem: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/332/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1368759166595764224 Associated logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/332/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1368759166595764224/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-84bc59b6dd-wlnzl_machine-controller.log A failing CI run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/308/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1367635600877817856 Logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/308/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1367635600877817856/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-6df5f9998c-ntw7t_machine-controller.log I have not seen this problem in other CI jobs.
I am working to get a nightly job running for vsphere+ovn. PR here: https://github.com/openshift/release/pull/16696 Once that is in, I'll see if I can figure anything out w/ regards to this bug.
This issue has increased the time for PR merges in https://github.com/openshift/windows-machine-config-operator and is slowing down development. Hence I am raising the priority and severity of this bug.
Just starting to look at this now, since I saw the priority was raised. what I noticed right away is that 7 out of last 60 jobs are failing. these are all presubmit jobs so will be running on unmerged and possibly unreviewed PRs, so some failures could be because of that. If you drill back in the job history, it doesn't look like this job was ever really healthy, although it was passing at a slightly higher rate if you look back in to Feb 2021 or older. the periodic job is now running and I see that the 4.8 passed it's first try: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere-ovn/1372970806522417152 But that's a different CI deployment I guess. it's hybrid-vxlan for these windows presubmit jobs. I also notice the main symptom in the description: "reconciler failed to Update machine: failed to reconcile tags" is showing up in both passing and failing jobs, so not sure if that is a place to focus or not. Looking at the gather-extra logs given in the comments above. I pushed a an empty PR on top of master to see the CI results without any changes involved just to gather data: https://github.com/openshift/windows-machine-config-operator/pull/354 that's as far as I got today. will continue to dig on this next week. but if this job is not healthy and mostly fails and all the PR devs are doing is /retest until they see it passing, I'd suggest removing the job for now until it gets healthy enough to be useful.
@jluhrsen thanks for looking into this. > But that's a different CI deployment I guess. it's hybrid-vxlan for these > windows presubmit jobs. Correct. For Windows nodes, the cluster has to use hybrid-vxlan. And for vSphere we have to use a custom VXLAN port. > > I also notice the main symptom in the description: "reconciler failed to > Update machine: failed to reconcile tags" is showing up in both passing and > failing jobs, so not sure if that is a place to focus or not. Looking at the > gather-extra logs given in the comments above. That is the symptom. > but if this job is not healthy and mostly fails and all the PR devs are > doing is /retest until they see it passing, I'd suggest removing the > job for now until it gets healthy enough to be useful. I am afraid that is not an option. vSphere is the number 1 platform for Windows and we cannot risk a regression in that area. If you need help bringing up a vSphere cluster or debugging the platform itself, please post on #vsphere-ci-triage.
While the machine-api does seem to have some connection errors from time to time without any problems on other jobs, I'm only seeing "dial tcp 44.236.21.251:443: connect: no route to host" on this job, and I'm seeing it frequently. There's no reason for the machine-api pod to ever receive a no route to host error.
Looks like this might be caused by: https://bugzilla.redhat.com/show_bug.cgi?id=1935539
Just to add some more evidence of failure, also seeing this on https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/454/pull-ci-openshift-ovn-kubernetes-master-e2e-vsphere-ovn/1372968332101160960 which is a different job.
(In reply to Michael Gugino from comment #8) > Looks like this might be caused by: > https://bugzilla.redhat.com/show_bug.cgi?id=1935539 probably not; that bug only affects traffic going over the VXLAN/Geneve tunnel (ie traffic going to a pod on another node), while this is about traffic from a pod to a cluster-external IP (which wouldn't have to be tunneled).
(In reply to Aravindh Puthiyaparambil from comment #6) > > > > I also notice the main symptom in the description: "reconciler failed to > > Update machine: failed to reconcile tags" is showing up in both passing and > > failing jobs, so not sure if that is a place to focus or not. Looking at the > > gather-extra logs given in the comments above. > > That is the symptom. The symptom is there in both a passing and failing job so my first guess is that it's a red herring and not where we should focus to find the problem. Maybe I'm missing something though. I do see that the error message is more prevalent in the failing job and occurs even at the end of the log, so perhaps we do expect to see that type of message initially but not after the cluster is stable. examples: Passing job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/354/pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-operator/1373018128086208512 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/354/pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-operator/1373018128086208512/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-58677b5586-2mt5k_machine-controller.log Failing job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/341/pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-operator/1372957497253433344 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/341/pull-ci-openshift-windows-machine-config-operator-master-vsphere-e2e-operator/1372957497253433344/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-7bb577d587-zs2v4_machine-controller.log Looking today, I noticed that only 2 of the last 20 jobs even started tests. most were failing with something like: "failed to fetch Master Machines: failed to load asset "Install Config": invalid "install-config.yaml" file: [platform.vsphere.apiVIP: Invalid value: "192.168.7.2": must be contained within one of the machine networks, platform.vsphere.ingressVIP: Invalid value: "192.168.7.3": must be contained within one of the machine networks]" But that is some different problem and not realted to this bz. I'll continue to investigate the logs and report back what I find.
@jluhrsen the "Install Config": invalid "install-config.yaml" file: [platform.vsphere.apiVIP: Invalid value: "192.168.7.2" is indeed a new issue like you mentioned and https://github.com/openshift/installer/pull/4779 has been opened to address that.
I just hit this setting up a new cluster in VMC w/ovn oc logs -f machine-api-controllers-777d558fc9-k8sdt -c machine-controller ... I0324 13:49:48.290935 1 controller.go:168] jcallen-rbk8r-worker-jwmvz: reconciling Machine I0324 13:49:48.290967 1 actuator.go:109] jcallen-rbk8r-worker-jwmvz: actuator checking if machine exists E0324 13:50:18.297376 1 controller.go:271] jcallen-rbk8r-worker-jwmvz: failed to check if machine exists: jcallen-rbk8r-worker-jwmvz: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk": dial tcp: i/o timeout I0324 13:50:18.297554 1 controller.go:168] jcallen-rbk8r-master-0: reconciling Machine I0324 13:50:18.297589 1 actuator.go:109] jcallen-rbk8r-master-0: actuator checking if machine exists E0324 13:50:48.309443 1 controller.go:271] jcallen-rbk8r-master-0: failed to check if machine exists: jcallen-rbk8r-master-0: failed to create scope for machine: failed to create vSphere session: error setting up new vSphere SOAP client: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk": dial tcp: i/o timeout _ /projects oc rsh machine-api-controllers-777d558fc9-k8sdt Defaulting container name to machineset-controller. Use 'oc describe pod/machine-api-controllers-777d558fc9-k8sdt -n openshift-machine-api' to see all of the containers in this pod. sh-4.4$ curl https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk ^C sh-4.4$ curl https://vcenter.sddc-44-236-21-251.vmwarevmc.com ^C sh-4.4$ host -t A vcenter.sddc-44-236-21-251.vmwarevmc.com ;; connection timed out; no servers could be reached sh-4.4$ host -t A vcenter.sddc-44-236-21-251.vmwarevmc.com ;; connection timed out; no servers could be reached sh-4.4$ cat /etc/resolv.conf search openshift-machine-api.svc.cluster.local svc.cluster.local cluster.local jcallen.vmc.devcluster.openshift.com nameserver 172.30.0.10 options ndots:5 sh-4.4$
dial timeout can be caused by a multitude of things. No route to host means the kernel doesn't know where to route the traffic, which is different. The underlying cause might be the same, but I want to focus on the no route to host in this BZ if possible.
just to update, I'm working to get dev access so that I can bring up my own vsphere cluster like we have in this job that is failing. I also filed a PR to gather network details in that same job: https://github.com/openshift/release/pull/17133 I have been digging through job artifacts for the past couple of days and not coming up with any answers yet.
I finally have a dev cluster that appears to exhibit the same symptoms, so will try to debug the problem from that. [jluhrsen@ip-10-0-5-26 ~]$ oc logs machine-api-controllers-6d7d5b7dd7-chw9m machine-controller | egrep 'no route to host' | tail -n3 E0330 20:57:18.979550 1 actuator.go:57] jluhrsen-jmdxb-master-2 error: jluhrsen-jmdxb-master-2: reconciler failed to Update machine: failed to reconcile tags: Get "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/tagging/tag": dial tcp 44.236.21.251:443: connect: no route to host E0330 20:57:18.979592 1 controller.go:299] jluhrsen-jmdxb-master-2: error updating machine: jluhrsen-jmdxb-master-2: reconciler failed to Update machine: failed to reconcile tags: Get "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/tagging/tag": dial tcp 44.236.21.251:443: connect: no route to host, retrying in 30s seconds E0330 21:16:34.418444 1 session.go:191] Failed to logout: Delete "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session": dial tcp 44.236.21.251:443: connect: no route to host [jluhrsen@ip-10-0-5-26 ~]$
(In reply to jamo luhrsen from comment #16) > I finally have a dev cluster that appears to exhibit the same symptoms, so > will try to debug the problem from that. > > [jluhrsen@ip-10-0-5-26 ~]$ oc logs machine-api-controllers-6d7d5b7dd7-chw9m > machine-controller | egrep 'no route to host' | tail -n3 > E0330 20:57:18.979550 1 actuator.go:57] jluhrsen-jmdxb-master-2 error: > jluhrsen-jmdxb-master-2: reconciler failed to Update machine: failed to > reconcile tags: Get > "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/ > tagging/tag": dial tcp 44.236.21.251:443: connect: no route to host > E0330 20:57:18.979592 1 controller.go:299] jluhrsen-jmdxb-master-2: > error updating machine: jluhrsen-jmdxb-master-2: reconciler failed to Update > machine: failed to reconcile tags: Get > "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/ > tagging/tag": dial tcp 44.236.21.251:443: connect: no route to host, > retrying in 30s seconds > E0330 21:16:34.418444 1 session.go:191] Failed to logout: Delete > "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/ > session": dial tcp 44.236.21.251:443: connect: no route to host > [jluhrsen@ip-10-0-5-26 ~]$ while debugging, I noticed that I was not able to hit the vcenter endpoint (curl and/or wget) from the machine-controller container, but I was also seeing similar symptoms from the node itself as well as the bastion system. This made me think there is something getting in the way outside of the cluster. I ended up re-deploying the cluster and accidentally using openshift-sdn and access to that vcenter system was working from all places (machine-controller container, host node, and bastion system). Is there any networking config happening outside the of the cluster when deploying this setup that could be getting in the way when it's OVN + hybrid vxlan? The traffic to hit that endpoint should be leaving the node on br-ex and I'm not sure what would happen beyond that to block it, but that's what I saw. I was attempting to bring back up an OVN cluster to check again if we lose access to the vcenter, but running in to some infra issues I think. Errors like this: E0401 05:59:16.526705 1789814 reflector.go:138] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://api.jluhrsen.vmc.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&resourceVersion=206": dial tcp 172.31.250.83:6443: i/o timeout I0401 06:00:13.035518 1789814 trace.go:205] Trace[1270937323]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (01-Apr-2021 05:59:43.034) (total time: 30000ms): Trace[1270937323]: [30.000576435s] [30.000576435s] END and I saw some slack conversations around the same. I'd like to figure out why these deployments stopped working so I can check OVN vs openshift-sdn again to see if this is happening consistently.
more clues to what the actual problem could be: The 'no route to host' is happening in both an OVNKubernetes and OpenShiftSDN deployment. I can reproduce the symptom directly on a node itself so it's not anything specific to the plumbing to a pod. Also, intermittently we will see the i/o timeout. When the timeout occurs it's because a new connection is sending new requests (SYNs) to open a connection, doesn't hear back and retries until it gives up. when the 'no route to host' is seen, it's because we are getting an ICMP timeout-exceeded from some intermediate system in AWS. like this: 20:56:03.398196 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com > ip-172-31-251-175.us-west-2.compute.internal: ICMP time exceeded in-transit, length 36 The test I'm running is to run tcpdump on the vcenter host ip for the egress interface (ens192 for sdn, and br-ex for ovn) and run curl in a loop to the vcenter ip and exit on non-zero exit code: $ tcpdump -ni ens192 host 44.236.21.251 -w dead_vcenter3.pcap & $ while curl https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/session; do sleep 1; done This seems like it would be some problem external to the deployment and possibly inside AWS? I don't know how to debug any further than this. Also, the curl test will run for many minutes (5+) sometimes... maybe longer. sometimes the 'no route to host' will come back to back to back... other times, you see it once and not again for another iteration. I don't see a pattern yet, and this follows what we are seeing in our CI jobs (some pass, some fail, some have more of these logs, some less). As final note, once I even saw this error: 20:56:03.398196 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com > ip-172-31-251-175.us-west-2.compute.internal: ICMP time exceeded in-transit, length 36 which is a RST in the middle of an open connection: 21:00:30.880338 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [S], seq 4162534308, win 29200, options [mss 1460,sackOK,TS val 1691328097 ecr 0,nop,wscale 7], length 0 21:00:31.940203 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [S], seq 4162534308, win 29200, options [mss 1460,sackOK,TS val 1691329157 ecr 0,nop,wscale 7], length 0 21:00:33.988199 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [S], seq 4162534308, win 29200, options [mss 1460,sackOK,TS val 1691331205 ecr 0,nop,wscale 7], length 0 21:00:38.020221 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [S], seq 4162534308, win 29200, options [mss 1460,sackOK,TS val 1691335237 ecr 0,nop,wscale 7], length 0 21:00:42.083858 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https > ip-172-31-251-175.us-west-2.compute.internal.41436: Flags [S.], seq 3814849712, ack 4162534309, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 8], length 0 21:00:42.083910 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [.], ack 1, win 229, length 0 21:00:42.092308 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517 21:00:42.340207 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517 21:00:42.588198 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517 21:00:43.076218 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517 21:00:44.100236 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517 21:00:46.084194 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517 21:00:49.988185 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517 21:00:57.860213 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [P.], seq 1:518, ack 1, win 229, length 517 21:00:57.861294 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https > ip-172-31-251-175.us-west-2.compute.internal.41436: Flags [.], ack 518, win 119, options [nop,nop,sack 1 {1:518}], length 0 21:01:58.276193 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [.], ack 1, win 229, length 0 21:01:58.277408 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https > ip-172-31-251-175.us-west-2.compute.internal.41436: Flags [.], ack 518, win 119, length 0 21:03:00.228204 IP ip-172-31-251-175.us-west-2.compute.internal.41436 > ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https: Flags [.], ack 1, win 229, length 0 21:03:00.229362 IP ec2-44-236-21-251.us-west-2.compute.amazonaws.com.https > ip-172-31-251-175.us-west-2.compute.internal.41436: Flags [R], seq 3814849713, win 0, length 0 pcaket captures are attached for the 'no route to host' and ssl error cases. I lost the i/o timeout packet capture when I forgot to offload it between deployments. if that's important, I can try to get it.
Created attachment 1770031 [details] no route to host packet capture
Created attachment 1770032 [details] ssl error packet capture
@jcallen has opened a ticket with vmware to try to get to the bottom of this: https://console.cloud.vmware.com/csp/gateway/portal/#/support/21211690804
@aravindh, @mgugino, closing this now that we have a ticket with vmware. please re-open if you think we should.
I'd prefer this stays open until we've resolved the issue.
There was some work on the AWS side that seems to have resolved this issue. There was no progress getting this fixed on the VMWare side. @scuppett can comment on the changes made. The last time this occured was 4/26: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/413/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1386796399454064640/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-574864c947-lb4cv_machine-controller.log Looking at the jobs since 4/29 that ended up with machine-controller logs (6 of them as not all jobs are pulling those logs successfullly), this 'no route to host' is no longer seen: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/426/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1387875567482703872/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-6bc6848bf7-4w2dx_machine-controller.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/428/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1387918067857625088/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-846d7f9f56-lbmls_machine-controller.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/428/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1388128528104427520/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-66fd9b4c9b-wp4nj_machine-controller.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/428/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1388202918317920256/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-9ff8d998b-cg6qn_machine-controller.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/430/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1388224054971863040/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-66b6fff874-4fz25_machine-controller.log https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-operator/433/pull-ci-openshift-windows-machine-config-operator-release-4.7-vsphere-e2e-operator/1389251847587368960/artifacts/vsphere-e2e-operator/gather-extra/artifacts/pods/openshift-machine-api_machine-api-controllers-5f8497d755-2t4r5_machine-controller.log Ok to close this now @mgugino ?
To provide private resolution from the same endpoint, we ended up creating a private hosted zone in Route53. 1) Existing: DNS resolution inside VMC was using 10.0.0.2 (AWS VPC DNS) 2) Created private hosted zone for the SDDC domain and associated it to the VPC: sddc-##-###-##-###.vmwarevmc.com where ## represents the numerics of the actual, public IP in the zone. 3) Created record for the private vCenter address: vcenter.sddc-##-###-##-###.vmwarevmc.com -> 10.3.224.4 Now, VMs inside either VMC or anywhere else inside the VPC will use the 10.3.224.4 address of vcenter. This will need changed/recreated when SDDC is destroyed/recreated or if the vcenter address ever changes (we'll see how frequently that is before we automate any of it, I'd imagine we can query some APIs with a similar lambda to what we use to manage the route tables).
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days