Description of problem: can't install baremetal cluster and/or vbmc based (with dev-scripts) after masters are ready no workers are joining the cluster we did some investigation and found out the root cause is that bootstrap vm didnt teardown (and was still holding the VIP?) because cluster wasnt ready it seems that cluster wasnt ready because kube-apiserver wasnt ready and we found out that was because it was missing etcd mdns records manually adding those etcd entries (in mdns publisher) seems to have allowed the install to resume but such entries are not needed on a 4.5 cluster (and i believe on 4.3 neither) so we suspect an issue with etcd operator force requiring them right now Version-Release number of the following components: # oc version │··· Client Version: 4.4.0-0.nightly-2020-03-10-194324 │··· Server Version: 4.4.0-0.nightly-2020-03-10-194324 │··· the workers error: ```[ *** ] A start job is running for Ignition (fetch) (1h 49min 25s / no limit)[ 6567.970571] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1307 [ 6567.972844] ignition[938]: GET result: Internal Server Error [* ] A start job is running for Ignition (fetch) (1h 49min 30s / no limit)[ 6572.971612] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1308 [ 6572.973787] ignition[938]: GET result: Internal Server Error [ *** ] A start job is running for Ignition (fetch) (1h 49min 35s / no limit)[ 6577.972390] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1309 [ 6577.975167] ignition[938]: GET result: Internal Server Error [ **] A start job is running for Ignition (fetch) (1h 49min 40s / no limit)[ 6582.973675] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1310 [ 6582.976332] ignition[938]: GET result: Internal Server Error ``` the kube-apiserver error: ```W0311 08:45:52.695024 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ostest.test.metalkube.org:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-1.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting... W0311 08:45:54.972813 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ostest.test.metalkube.org:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-1.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting... W0311 08:45:55.595146 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-0.ostest.test.metalkube.org:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-0.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting... ``` and again, manually adding the entries resolved the issue This is blocking us from installing any environment for dev/QE/integration/whatever more info might be found here https://coreos.slack.com/archives/CFP6ST0A3/p1583914909434800
If kube-apiserver is hardcoding in 4.4 (not 4.5) to etc-0, etcd-1, etcd-2 somewhere we should fix it to check the SRV record instead.
A workaround is to do the following on each master: - Add the following snippet in /etc/mdns/config.hcl (at the end of the file), by setting the proper number depending on the master. service { name = "ostest EtcdWorkstation" host_name = "etcd-$NUMBER.local." type = "_workstation._tcp" domain = "local." port = 42424 ttl = 300 } - Make sure mdns publisher static pod is restarted with crictl stop $(crictl ps | grep mdns | cut -f1 -d" ")
They are not hardcoded. Since the beginning of 4.x, we harvest the names from the endpoints and the DNS entries are present: https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.1/pkg/operator/configobservation/etcd/observe_etcd.go#L58-L68 . In fact, prior to 4.4, it was actually impossible to even start etcd without having DNS entries working, because we used the etcd DNS discovery mechanism. If you're having trouble with this, you probably want to figure out what is wrong with mDNS. Separately, in 4.4 we developed an operator that was able to remove the long-standing etcd DNS dependency and we merged the backport of a 4.5 change to the kube-apiserver-operator to use IP addresses (https://github.com/openshift/cluster-kube-apiserver-operator/pull/792), but you should sort out why your DNS is broken. I'm not sure what else in the stack is going to break for you.
Note that this affected more than just BM. A similar error was reported in https://bugzilla.redhat.com/show_bug.cgi?id=1811530. Closing as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1812071 tracking the backport to 4.4. *** This bug has been marked as a duplicate of bug 1812071 ***