1663447 – etcd cluster failed to start when cluster name ends with "-"

Bug 1663447 - etcd cluster failed to start when cluster name ends with "-"

Summary: etcd cluster failed to start when cluster name ends with "-"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Matthew Staebler
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-04 11:26 UTC by Johnny Liu
Modified:	2019-06-04 10:41 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:41:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:41:36 UTC

Description Johnny Liu 2019-01-04 11:26:50 UTC

Description of problem:
See the following details.

Version-Release number of the following components:
# ./openshift-install version
./openshift-install v0.8.0-master-8-g713289e20bd6afccb06f2e4ff7ed89d2483fac9a

How reproducible:
Always

Steps to Reproduce:
1. Trigger an install with "qe-jialiu-" cluster name 
2. 
3.

Actual results:
Install failed with the following error.
INFO Waiting up to 30m0s for the Kubernetes API... 
DEBUG Still waiting for the Kubernetes API: Get https://qe-jialiu--api.qe.devcluster.openshift.com:6443/version?timeout=32s: dial tcp 3.17.117.40:6443: i/o timeout 
DEBUG Still waiting for the Kubernetes API: Get https://qe-jialiu--api.qe.devcluster.openshift.com:6443/version?timeout=32s: dial tcp 3.16.59.249:6443: connect: connection refused 
DEBUG Still waiting for the Kubernetes API: the server could not find the requested resource 
DEBUG Still waiting for the Kubernetes API: the server could not find the requested resource 
DEBUG Still waiting for the Kubernetes API: the server could not find the requested resource 

Go to bootstrap node, get the following log:
# journalctl -b -f -u bootkube.service
-- Logs begin at Fri 2019-01-04 10:51:44 UTC. --
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: https://qe-jialiu--etcd-1.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.21.223:2379: getsockopt: connection refused
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: https://qe-jialiu--etcd-2.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.42.31:2379: getsockopt: connection refused
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: https://qe-jialiu--etcd-0.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.3.48:2379: getsockopt: connection refused
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: Error:  unhealthy cluster
Jan 04 11:13:15 ip-10-0-6-86 bootkube.sh[4399]: etcdctl failed. Retrying in 5 seconds...


Expected results:
Installation should be completed successfully. If etcd cluster do not work with "qe-jialiu--" prefix, installer would prompt user, and exit installer in advance. 

Additional info:
After correct cluster name with "qe-jialiu", installation would be completed successfully

Comment 1 Alex Crawford 2019-01-09 19:18:35 UTC

I was able to confirm this. I tried to use "crawford-" as my cluster name.

On the master node, I see the following from the discovery container:

# crictl logs 8a3df72c9e097
I0109 18:52:09.698063       1 run.go:47] Version: 3.11.0-408-g09742d64-dirty
I0109 18:52:09.698592       1 run.go:57] ip addr is 192.168.126.11
E0109 18:52:09.698666       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:53:09.698965       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:54:09.698973       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:55:09.698975       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:56:09.698920       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:57:09.698970       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
E0109 18:57:09.699024       1 run.go:63] error looking up self: lookup _etcd-server-ssl._tcp.crawford-.openshift.testing: no such host
F0109 18:57:09.699056       1 main.go:30] Error executing etcd-setup-environment: could not find self: timed out waiting for the condition


In /var/lib/libvirt/dnsmasq/crawford-.conf I see the following entry:

srv-host=_etcd-server-ssl._tcp.crawford-.openshift.testing,crawford--etcd-0.openshift.testing,2380,0,10


I'm also able to use dig to fetch that record:

$ dig _etcd-server-ssl._tcp.crawford-.openshift.testing SRV +short
0 10 2380 crawford--etcd-0.openshift.testing.


It looks like the problem lies within registry.svc.ci.openshift.org/openshift/origin-v4.0:setup-etcd-environment (https://github.com/openshift/machine-config-operator/blob/09742d642e6846afcf1297ae6911e6bdfc88a48d/cmd/setup-etcd-environment/run.go).

Comment 2 Alex Crawford 2019-02-13 22:53:40 UTC

Abhinav, did you get a chance to dig into this further. Last I remember, we traced the problem back to the Go standard library but maybe a trailing hyphen isn't a valid subdomain/hostname.

Comment 5 Matthew Staebler 2019-02-14 18:13:41 UTC

Fix in https://github.com/openshift/installer/pull/1255.

The installer should not be allowing a cluster name that ends with a hyphen. The installer was validating this when the cluster name was entered in the CLI. But the installer was not validating this when an install-config.yaml was provided.

Comment 6 Johnny Liu 2019-02-15 08:14:30 UTC

Verified this bug with v4.0.0-0.173.0.0-dirty, and PASS.

# ./openshift-install version
./openshift-install v4.0.0-0.173.0.0-dirty

# ./openshift-install create cluster --dir demo
? Platform aws
? Region us-east-2
? Base Domain qe.devcluster.openshift.com
X Sorry, your reply was invalid: a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for valX Sorry, your reply was invalid: a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
? Cluster Name qe-jialiu
? Pull Secret [? for help] *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************WARNING Found override for OS Image. Please be warned, this is not advised 
WARNING Found override for ReleaseImage. Please be warned, this is not advised 
INFO Creating cluster...

Comment 7 Johnny Liu 2019-02-18 02:27:35 UTC

According to comment 6, move this bug to 'VERIFIED'.

Comment 8 W. Trevor King 2019-02-27 05:31:40 UTC

And 0.13.0 is out with the fix [1].

[1]: https://github.com/openshift/installer/releases/tag/v0.13.0

Comment 11 errata-xmlrpc 2019-06-04 10:41:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.