Bug 1663447 - etcd cluster failed to start when cluster name ends with "-"
Summary: etcd cluster failed to start when cluster name ends with "-"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Matthew Staebler
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-04 11:26 UTC by Johnny Liu
Modified: 2019-06-04 10:41 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:41:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:41:36 UTC

Description Johnny Liu 2019-01-04 11:26:50 UTC
Description of problem:
See the following details.

Version-Release number of the following components:
# ./openshift-install version
./openshift-install v0.8.0-master-8-g713289e20bd6afccb06f2e4ff7ed89d2483fac9a

How reproducible:
Always

Steps to Reproduce:
1. Trigger an install with "qe-jialiu-" cluster name 
2. 
3.

Actual results:
Install failed with the following error.
INFO Waiting up to 30m0s for the Kubernetes API... 
DEBUG Still waiting for the Kubernetes API: Get https://qe-jialiu--api.qe.devcluster.openshift.com:6443/version?timeout=32s: dial tcp 3.17.117.40:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://qe-jialiu--api.qe.devcluster.openshift.com:6443/version?timeout=32s: dial tcp 3.16.59.249:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: the server could not find the requested resource 
DEBUG Still waiting for the Kubernetes API: the server could not find the requested resource 
DEBUG Still waiting for the Kubernetes API: the server could not find the requested resource 

Go to bootstrap node, get the following log:
# journalctl -b -f -u bootkube.service
-- Logs begin at Fri 2019-01-04 10:51:44 UTC. --
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: https://qe-jialiu--etcd-1.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.21.223:2379: getsockopt: connection refused
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: https://qe-jialiu--etcd-2.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.42.31:2379: getsockopt: connection refused
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: https://qe-jialiu--etcd-0.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.3.48:2379: getsockopt: connection refused
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: Error:  unhealthy cluster
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: etcdctl failed. Retrying in 5 seconds...


Expected results:
Installation should be completed successfully. If etcd cluster do not work with "qe-jialiu--" prefix, installer would prompt user, and exit installer in advance. 

Additional info:
After correct cluster name with "qe-jialiu", installation would be completed successfully

Comment 1 Alex Crawford 2019-01-09 19:18:35 UTC
I was able to confirm this. I tried to use "crawford-" as my cluster name.

On the master node, I see the following from the discovery container:

# crictl logs 8a3df72c9e097
I0109 18:52:09.698063       1 run.go:47] Version: 3.11.0-408-g09742d64-dirty
I0109 18:52:09.698592       1 run.go:57] ip addr is 192.168.126.11
E0109 18:52:09.698666       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:53:09.698965       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:54:09.698973       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:55:09.698975       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:56:09.698920       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:57:09.698970       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:57:09.699024       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
F0109 18:57:09.699056       1 main.go:30] Error executing etcd-setup-environment: could not find self: timed out waiting for the condition


In /var/lib/libvirt/dnsmasq/crawford-.conf I see the following entry:

srv-host=_etcd-server-ssl._tcp.crawford-.openshift.testing,crawford--etcd-0.openshift.testing,2380,0,10


I'm also able to use dig to fetch that record:

$ dig _etcd-server-ssl._tcp.crawford-.openshift.testing SRV +short
0 10 2380 crawford--etcd-0.openshift.testing.


It looks like the problem lies within registry.svc.ci.openshift.org/openshift/origin-v4.0:setup-etcd-environment (https://github.com/openshift/machine-config-operator/blob/09742d642e6846afcf1297ae6911e6bdfc88a48d/cmd/setup-etcd-environment/run.go).

Comment 2 Alex Crawford 2019-02-13 22:53:40 UTC
Abhinav, did you get a chance to dig into this further. Last I remember, we traced the problem back to the Go standard library but maybe a trailing hyphen isn't a valid subdomain/hostname.

Comment 5 Matthew Staebler 2019-02-14 18:13:41 UTC
Fix in https://github.com/openshift/installer/pull/1255.

The installer should not be allowing a cluster name that ends with a hyphen. The installer was validating this when the cluster name was entered in the CLI. But the installer was not validating this when an install-config.yaml was provided.

Comment 6 Johnny Liu 2019-02-15 08:14:30 UTC
Verified this bug with v4.0.0-0.173.0.0-dirty, and PASS.

# ./openshift-install version
./openshift-install v4.0.0-0.173.0.0-dirty

# ./openshift-install create cluster --dir demo
? Platform aws
? Region us-east-2
? Base Domain qe.devcluster.openshift.com
X Sorry, your reply was invalid: a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for valX Sorry, your reply was invalid: a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
? Cluster Name qe-jialiu
? Pull Secret [? for help] *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************WARNING Found override for OS Image. Please be warned, this is not advised 
WARNING Found override for ReleaseImage. Please be warned, this is not advised 
INFO Creating cluster...

Comment 7 Johnny Liu 2019-02-18 02:27:35 UTC
According to comment 6, move this bug to 'VERIFIED'.

Comment 8 W. Trevor King 2019-02-27 05:31:40 UTC
And 0.13.0 is out with the fix [1].

[1]: https://github.com/openshift/installer/releases/tag/v0.13.0

Comment 11 errata-xmlrpc 2019-06-04 10:41:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.