Bug 1812409 - cannot install on BM or with dev-scripts due to missing etcd mdns records
Summary: cannot install on BM or with dev-scripts due to missing etcd mdns records
Keywords:
Status: CLOSED DUPLICATE of bug 1812071
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: All
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ben Nemec
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-11 09:37 UTC by Yuval Kashtan
Modified: 2020-03-11 13:50 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-11 13:50:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Yuval Kashtan 2020-03-11 09:37:03 UTC
Description of problem:
can't install baremetal cluster
and/or vbmc based (with dev-scripts)

after masters are ready
no workers are joining the cluster


we did some investigation and found out the root cause is that bootstrap vm didnt teardown (and was still holding the VIP?) because cluster wasnt ready
it seems that cluster wasnt ready because kube-apiserver wasnt ready

and we found out that was because it was missing etcd mdns records
manually adding those etcd entries (in mdns publisher) seems to have allowed the install to resume
but such entries are not needed on a 4.5 cluster (and i believe on 4.3 neither)
so we suspect an issue with etcd operator force requiring them right now

Version-Release number of the following components:
# oc version                                                                                │···
Client Version: 4.4.0-0.nightly-2020-03-10-194324                                                                 │···
Server Version: 4.4.0-0.nightly-2020-03-10-194324                                                                 │···

the workers error:
```[  *** ] A start job is running for Ignition (fetch) (1h 49min 25s / no limit)[ 6567.970571] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1307
[ 6567.972844] ignition[938]: GET result: Internal Server Error                                                      
[*     ] A start job is running for Ignition (fetch) (1h 49min 30s / no limit)[ 6572.971612] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1308
[ 6572.973787] ignition[938]: GET result: Internal Server Error                                                      
[  *** ] A start job is running for Ignition (fetch) (1h 49min 35s / no limit)[ 6577.972390] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1309
[ 6577.975167] ignition[938]: GET result: Internal Server Error                                                      
[    **] A start job is running for Ignition (fetch) (1h 49min 40s / no limit)[ 6582.973675] ignition[938]: GET https://192.168.111.5:22623/config/worker: attempt #1310
[ 6582.976332] ignition[938]: GET result: Internal Server Error           
```

the kube-apiserver error:
```W0311 08:45:52.695024       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ostest.test.metalkube.org:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-1.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting...                              
W0311 08:45:54.972813       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-1.ostest.test.metalkube.org:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-1.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting...                              
W0311 08:45:55.595146       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd-0.ostest.test.metalkube.org:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-0.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting...  
```

and again, manually adding the entries resolved the issue




This is blocking us from installing any environment for dev/QE/integration/whatever

more info might be found here
https://coreos.slack.com/archives/CFP6ST0A3/p1583914909434800

Comment 1 Antoni Segura Puimedon 2020-03-11 10:26:40 UTC
If kube-apiserver is hardcoding in 4.4 (not 4.5) to etc-0, etcd-1, etcd-2 somewhere we should fix it to check the SRV record instead.

Comment 3 Karim Boumedhel 2020-03-11 10:31:44 UTC
A workaround is to do the following on each master:

- Add the following snippet in /etc/mdns/config.hcl (at the end of the file), by setting the proper number depending on the master.

service {
    name = "ostest EtcdWorkstation"
    host_name = "etcd-$NUMBER.local."
    type = "_workstation._tcp"
    domain = "local."
    port = 42424
    ttl = 300
}

- Make sure mdns publisher static pod is restarted with

crictl stop $(crictl ps | grep mdns | cut -f1 -d" ")

Comment 4 David Eads 2020-03-11 12:01:55 UTC
They are not hardcoded.  Since the beginning of 4.x, we harvest the names from the endpoints and the DNS entries are present: https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.1/pkg/operator/configobservation/etcd/observe_etcd.go#L58-L68 .  In fact, prior to 4.4, it was actually impossible to even start etcd without having DNS entries working, because we used the etcd DNS discovery mechanism.

If you're having trouble with this, you probably want to figure out what is wrong with mDNS.  

Separately, in 4.4 we developed an operator that was able to remove the long-standing etcd DNS dependency and we merged the backport of a 4.5 change to the kube-apiserver-operator to use IP addresses (https://github.com/openshift/cluster-kube-apiserver-operator/pull/792), but you should sort out why your DNS is broken.  I'm not sure what else in the stack is going to break for you.

Comment 5 Martin André 2020-03-11 13:50:51 UTC
Note that this affected more than just BM. A similar error was reported in https://bugzilla.redhat.com/show_bug.cgi?id=1811530.

Closing as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1812071 tracking the backport to 4.4.

*** This bug has been marked as a duplicate of bug 1812071 ***


Note You need to log in before you can comment on or make changes to this bug.