Bug 1884435 - vsphere - loopback is randomly not being added to resolver
Summary: vsphere - loopback is randomly not being added to resolver
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Ben Nemec
QA Contact: jima
URL:
Whiteboard:
Depends On:
Blocks: 1885624
TreeView+ depends on / blocked
 
Reported: 2020-10-02 00:32 UTC by Joseph Callen
Modified: 2021-01-20 21:10 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1885624 (view as bug list)
Environment:
Last Closed: 2021-01-20 21:10:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bootstrap log bundle (622.28 KB, application/gzip)
2020-10-02 00:53 UTC, Joseph Callen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4237 0 None closed Bug 1884435: vsphere - add delay if resolv.conf is not available; wait for dhcp 2021-02-18 14:08:59 UTC

Description Joseph Callen 2020-10-02 00:32:24 UTC
Version:

OpenShift Installer 4.6.0-0.nightly-2020-10-01-181852                                                                                                                                             DEBUG Built from commit 540f6a9dc127936c1085511daf5961342ec1

Platform: vsphere ipi


What happened?


This script is supposed to add 127.0.0.1 to /etc/resolv.conf since coredns is running on bootstrap to provide DNS for api-int.

https://github.com/openshift/installer/blob/master/data/data/bootstrap/vsphere/files/etc/NetworkManager/dispatcher.d/30-local-dns-prepender.template



from bootkube...

1.Etcd: failed to list *v1.Etcd: Get "https://api-int.jcallen.vmc.devcluster.openshift.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0":dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host
Oct 02 00:21:57 ip-172-31-251-83.us-west-2.compute.internal bootkube.sh[2362]: E1002 00:21:57.354155       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.jcallen.vmc.devcluster.openshift.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0":dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host
Oct 02 00:22:33 ip-172-31-251-83.us-west-2.compute.internal bootkube.sh[2362]: E1002 00:22:33.557896       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.jcallen.vmc.devcluster.openshift.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0":dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host
Oct 02 00:23:30 ip-172-31-251-83.us-west-2.compute.internal bootkube.sh[2362]: E1002 00:23:30.568147       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.jcallen.vmc.devcluster.openshift.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0":dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host
^C


[root@ip-172-31-251-83 ~]# journalctl -fu NetworkManager-dispatcher.service                                                                                                                            
-- Logs begin at Fri 2020-10-02 00:08:20 UTC. --
Oct 02 00:08:29 localhost systemd[1]: Starting Network Manager Script Dispatcher Service...
Oct 02 00:08:29 localhost systemd[1]: Started Network Manager Script Dispatcher Service.
Oct 02 00:08:29 localhost nm-dispatcher[1692]: <13>Oct  2 00:08:29 root: NM local-dns-prepender triggered by ens192 up.                                                                                
Oct 02 00:08:29 localhost nm-dispatcher[1692]: <13>Oct  2 00:08:29 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf                                            
Oct 02 00:08:29 localhost nm-dispatcher[1692]: <13>Oct  2 00:08:29 root: NM local-dns-prepender: Looking for '# Generated by NetworkManager' in /etc/resolv.conf to place 'nameserver 127.0.0.1'       
^C

[root@ip-172-31-251-83 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search us-west-2.compute.internal
nameserver 10.3.192.12

Comment 1 Joseph Callen 2020-10-02 00:53:22 UTC
Created attachment 1718337 [details]
bootstrap log bundle

Comment 2 Joseph Callen 2020-10-02 00:55:31 UTC
Not sure there is much in the log, I have the bootstrap node still available though.


DEBUG Unable to connect to the server: dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host                  
DEBUG Unable to connect to the server: dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host                  
DEBUG Gather remote logs
DEBUG Collecting info from 172.31.251.122
DEBUG Unable to connect to the server: dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host                  
DEBUG lost connection
 EBUG Warning: Permanently added '172.31.251.122' (ECDSA) to the list of known hosts.                                                                 
 EBUG core.251.122: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).                                                                
DEBUG Collecting info from 172.31.251.18
DEBUG lost connection
 EBUG Warning: Permanently added '172.31.251.18' (ECDSA) to the list of known hosts.                                                                  
 EBUG core.251.18: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).                                                                 
DEBUG Collecting info from 172.31.251.144
DEBUG lost connection
 EBUG Warning: Permanently added '172.31.251.144' (ECDSA) to the list of known hosts.                                                                 
 EBUG core.251.144: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).                                                                
DEBUG Log bundle written to /var/home/core/log-bundle-20201002004001.tar.gz
INFO Bootstrap gather logs captured here "/projects/installer-testing/vsphere-ipi/log-bundle-20201002004001.tar.gz"                                   
FATAL Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition

Comment 3 Joseph Callen 2020-10-02 12:58:02 UTC
Another example of this:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1311887212656201728


1033 Oct 02 05:08:41 ip-172-31-254-133.us-west-2.compute.internal bootkube.sh[2344]: E1002 05:08:41.177442       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api     -int.ci-op-64sd0h4w-0aec4.origin-ci-int-aws.dev.rhcloud.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0": dial tcp: lookup api-int.ci-op-64sd0h4w-0aec4.origin-ci-int-aws.dev.rhcloud.com o     n 10.3.192.12:53: no such host

Comment 5 Abhinav Dahiya 2020-10-02 16:31:35 UTC
Moving to the mDNS team that usually knows how to handle this setup.

Comment 6 Abhinav Dahiya 2020-10-02 16:32:39 UTC
If this was the wrong component please help me move it to the team that handles the hosted DNS for baremetal deployments.

Comment 7 Joseph Callen 2020-10-02 19:34:45 UTC
After just randomly going through MCO PRs I wonder if this is the _real_ fix:

https://github.com/openshift/machine-config-operator/pull/2030/files

Comment 9 jima 2020-10-23 03:33:06 UTC
Install ipi on vsphere with 4.7.0-0.nightly-2020-10-21-001511 and succeed, so move the bug to VERIFIED

Comment 11 Ben Nemec 2021-01-20 21:10:13 UTC
This was fixed as part of a different bug, so we don't need doc text on this one.


Note You need to log in before you can comment on or make changes to this bug.