Bug 1884435

Summary: vsphere - loopback is randomly not being added to resolver
Product: OpenShift Container Platform Reporter: Joseph Callen <jcallen>
Component: InstallerAssignee: Ben Nemec <bnemec>
Installer sub component: openshift-installer QA Contact: jima
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: high CC: adahiya, asegurap, bleanhar, xtian
Version: 4.6Keywords: Triaged
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1885624 (view as bug list) Environment:
Last Closed: 2021-01-20 21:10:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1885624    
Attachments:
Description Flags
bootstrap log bundle none

Description Joseph Callen 2020-10-02 00:32:24 UTC
Version:

OpenShift Installer 4.6.0-0.nightly-2020-10-01-181852                                                                                                                                             DEBUG Built from commit 540f6a9dc127936c1085511daf5961342ec1

Platform: vsphere ipi


What happened?


This script is supposed to add 127.0.0.1 to /etc/resolv.conf since coredns is running on bootstrap to provide DNS for api-int.

https://github.com/openshift/installer/blob/master/data/data/bootstrap/vsphere/files/etc/NetworkManager/dispatcher.d/30-local-dns-prepender.template



from bootkube...

1.Etcd: failed to list *v1.Etcd: Get "https://api-int.jcallen.vmc.devcluster.openshift.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0":dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host
Oct 02 00:21:57 ip-172-31-251-83.us-west-2.compute.internal bootkube.sh[2362]: E1002 00:21:57.354155       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.jcallen.vmc.devcluster.openshift.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0":dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host
Oct 02 00:22:33 ip-172-31-251-83.us-west-2.compute.internal bootkube.sh[2362]: E1002 00:22:33.557896       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.jcallen.vmc.devcluster.openshift.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0":dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host
Oct 02 00:23:30 ip-172-31-251-83.us-west-2.compute.internal bootkube.sh[2362]: E1002 00:23:30.568147       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.jcallen.vmc.devcluster.openshift.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0":dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host
^C


[root@ip-172-31-251-83 ~]# journalctl -fu NetworkManager-dispatcher.service                                                                                                                            
-- Logs begin at Fri 2020-10-02 00:08:20 UTC. --
Oct 02 00:08:29 localhost systemd[1]: Starting Network Manager Script Dispatcher Service...
Oct 02 00:08:29 localhost systemd[1]: Started Network Manager Script Dispatcher Service.
Oct 02 00:08:29 localhost nm-dispatcher[1692]: <13>Oct  2 00:08:29 root: NM local-dns-prepender triggered by ens192 up.                                                                                
Oct 02 00:08:29 localhost nm-dispatcher[1692]: <13>Oct  2 00:08:29 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf                                            
Oct 02 00:08:29 localhost nm-dispatcher[1692]: <13>Oct  2 00:08:29 root: NM local-dns-prepender: Looking for '# Generated by NetworkManager' in /etc/resolv.conf to place 'nameserver 127.0.0.1'       
^C

[root@ip-172-31-251-83 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search us-west-2.compute.internal
nameserver 10.3.192.12

Comment 1 Joseph Callen 2020-10-02 00:53:22 UTC
Created attachment 1718337 [details]
bootstrap log bundle

Comment 2 Joseph Callen 2020-10-02 00:55:31 UTC
Not sure there is much in the log, I have the bootstrap node still available though.


DEBUG Unable to connect to the server: dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host                  
DEBUG Unable to connect to the server: dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host                  
DEBUG Gather remote logs
DEBUG Collecting info from 172.31.251.122
DEBUG Unable to connect to the server: dial tcp: lookup api-int.jcallen.vmc.devcluster.openshift.com on 10.3.192.12:53: no such host                  
DEBUG lost connection
 EBUG Warning: Permanently added '172.31.251.122' (ECDSA) to the list of known hosts.                                                                 
 EBUG core.251.122: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).                                                                
DEBUG Collecting info from 172.31.251.18
DEBUG lost connection
 EBUG Warning: Permanently added '172.31.251.18' (ECDSA) to the list of known hosts.                                                                  
 EBUG core.251.18: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).                                                                 
DEBUG Collecting info from 172.31.251.144
DEBUG lost connection
 EBUG Warning: Permanently added '172.31.251.144' (ECDSA) to the list of known hosts.                                                                 
 EBUG core.251.144: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).                                                                
DEBUG Log bundle written to /var/home/core/log-bundle-20201002004001.tar.gz
INFO Bootstrap gather logs captured here "/projects/installer-testing/vsphere-ipi/log-bundle-20201002004001.tar.gz"                                   
FATAL Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition

Comment 3 Joseph Callen 2020-10-02 12:58:02 UTC
Another example of this:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1311887212656201728


1033 Oct 02 05:08:41 ip-172-31-254-133.us-west-2.compute.internal bootkube.sh[2344]: E1002 05:08:41.177442       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api     -int.ci-op-64sd0h4w-0aec4.origin-ci-int-aws.dev.rhcloud.com:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0": dial tcp: lookup api-int.ci-op-64sd0h4w-0aec4.origin-ci-int-aws.dev.rhcloud.com o     n 10.3.192.12:53: no such host

Comment 5 Abhinav Dahiya 2020-10-02 16:31:35 UTC
Moving to the mDNS team that usually knows how to handle this setup.

Comment 6 Abhinav Dahiya 2020-10-02 16:32:39 UTC
If this was the wrong component please help me move it to the team that handles the hosted DNS for baremetal deployments.

Comment 7 Joseph Callen 2020-10-02 19:34:45 UTC
After just randomly going through MCO PRs I wonder if this is the _real_ fix:

https://github.com/openshift/machine-config-operator/pull/2030/files

Comment 9 jima 2020-10-23 03:33:06 UTC
Install ipi on vsphere with 4.7.0-0.nightly-2020-10-21-001511 and succeed, so move the bug to VERIFIED

Comment 11 Ben Nemec 2021-01-20 21:10:13 UTC
This was fixed as part of a different bug, so we don't need doc text on this one.