Bug 1812409

Summary: cannot install on BM or with dev-scripts due to missing etcd mdns records
Product: OpenShift Container Platform Reporter: Yuval Kashtan <ykashtan>
Component: InstallerAssignee: Ben Nemec <bnemec>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Amit Ugol <augol>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: asegurap, deads, kboumedh, m.andre
Version: 4.4   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-11 13:50:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yuval Kashtan 2020-03-11 09:37:03 UTC
Description of problem:
can't install baremetal cluster
and/or vbmc based (with dev-scripts)

after masters are ready
no workers are joining the cluster


we did some investigation and found out the root cause is that bootstrap vm didnt teardown (and was still holding the VIP?) because cluster wasnt ready
it seems that cluster wasnt ready because kube-apiserver wasnt ready

and we found out that was because it was missing etcd mdns records
manually adding those etcd entries (in mdns publisher) seems to have allowed the install to resume
but such entries are not needed on a 4.5 cluster (and i believe on 4.3 neither)
so we suspect an issue with etcd operator force requiring them right now

Version-Release number of the following components:
# oc version                                                                                │···
Client Version: 4.4.0-0.nightly-2020-03-10-194324                                                                 │···
Server Version: 4.4.0-0.nightly-2020-03-10-194324                                                                 │···

the workers error:
```[  *** ] A start job is running for Ignition (fetch) (1h 49min 25s / no limit)[ 6567.970571] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1307
[ 6567.972844] ignition[938]: GET result: Internal Server Error                                                      
[*     ] A start job is running for Ignition (fetch) (1h 49min 30s / no limit)[ 6572.971612] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1308
[ 6572.973787] ignition[938]: GET result: Internal Server Error                                                      
[  *** ] A start job is running for Ignition (fetch) (1h 49min 35s / no limit)[ 6577.972390] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1309
[ 6577.975167] ignition[938]: GET result: Internal Server Error                                                      
[    **] A start job is running for Ignition (fetch) (1h 49min 40s / no limit)[ 6582.973675] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1310
[ 6582.976332] ignition[938]: GET result: Internal Server Error           
```

the kube-apiserver error:
```W0311 08:45:52.695024       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ostest.test.metalkube.org:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-1.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting...                              
W0311 08:45:54.972813       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ostest.test.metalkube.org:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-1.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting...                              
W0311 08:45:55.595146       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-0.ostest.test.metalkube.org:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-0.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting...  
```

and again, manually adding the entries resolved the issue




This is blocking us from installing any environment for dev/QE/integration/whatever

more info might be found here
https://coreos.slack.com/archives/CFP6ST0A3/p1583914909434800

Comment 1 Antoni Segura Puimedon 2020-03-11 10:26:40 UTC
If kube-apiserver is hardcoding in 4.4 (not 4.5) to etc-0, etcd-1, etcd-2 somewhere we should fix it to check the SRV record instead.

Comment 3 Karim Boumedhel 2020-03-11 10:31:44 UTC
A workaround is to do the following on each master:

- Add the following snippet in /etc/mdns/config.hcl (at the end of the file), by setting the proper number depending on the master.

service {
    name = "ostest EtcdWorkstation"
    host_name = "etcd-$NUMBER.local."
    type = "_workstation._tcp"
    domain = "local."
    port = 42424
    ttl = 300
}

- Make sure mdns publisher static pod is restarted with

crictl stop $(crictl ps | grep mdns | cut -f1 -d" ")

Comment 4 David Eads 2020-03-11 12:01:55 UTC
They are not hardcoded.  Since the beginning of 4.x, we harvest the names from the endpoints and the DNS entries are present: https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.1/pkg/operator/configobservation/etcd/observe_etcd.go#L58-L68 .  In fact, prior to 4.4, it was actually impossible to even start etcd without having DNS entries working, because we used the etcd DNS discovery mechanism.

If you're having trouble with this, you probably want to figure out what is wrong with mDNS.  

Separately, in 4.4 we developed an operator that was able to remove the long-standing etcd DNS dependency and we merged the backport of a 4.5 change to the kube-apiserver-operator to use IP addresses (https://github.com/openshift/cluster-kube-apiserver-operator/pull/792), but you should sort out why your DNS is broken.  I'm not sure what else in the stack is going to break for you.

Comment 5 Martin André 2020-03-11 13:50:51 UTC
Note that this affected more than just BM. A similar error was reported in https://bugzilla.redhat.com/show_bug.cgi?id=1811530.

Closing as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1812071 tracking the backport to 4.4.

*** This bug has been marked as a duplicate of bug 1812071 ***