1631368 – cluster without cloudprovider enabled install failed due to etcd connection url hostname is mismatched with the one in cert files

Bug 1631368 - cluster without cloudprovider enabled install failed due to etcd connection url hostname is mismatched with the one in cert files

Summary: cluster without cloudprovider enabled install failed due to etcd connection u...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Russell Teague
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-20 12:26 UTC by Johnny Liu
Modified:	2019-10-22 02:36 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-14 13:23:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
installation log with inventory file embedded for qeos10 (3.80 MB, text/plain) 2018-09-20 12:26 UTC, Johnny Liu	no flags	Details
installation log with inventory file embedded for snvl2 (3.18 MB, text/plain) 2018-09-20 12:26 UTC, Johnny Liu	no flags	Details
View All

Description Johnny Liu 2018-09-20 12:26:11 UTC

Created attachment 1485119 [details]
installation log with inventory file embedded for qeos10

Description of problem:
QE have two deployment of OSP, one is "snvl2", another is "qeos10".

Today when I was trying to install a cluster without cloudprovider enabled on "snvl2" OSP, failed. But install on "qeos10" is okay.

Pls go throug the two install log, search "openshift_master_etcd_hosts" in the log, you will find the difference, one is using short hostname, another is using a fdqn hostname.


Version-Release number of the following components:
openshift-ansible-3.11.11-1.git.0.5d4f9d4.el7_5.noarch

How reproducible:
Always

Steps to Reproduce:
1. install a cluster without cloudprovider enabled.
2.
3.

Actual results:
master api fail to start.
master log:
I0920 10:38:19.442091       1 plugins.go:84] Registered admission plugin "PersistentVolumeClaimResize"
I0920 10:38:19.442114       1 plugins.go:84] Registered admission plugin "StorageObjectInUseProtection"
F0920 10:38:49.447038       1 start_api.go:68] context deadline exceeded

[root@qe-jialiu-master-etcd-1 ~]# etcdctl --ca-file "${ca_file}" --cert-file "${cert_file}" --key-file "${key_file}" --endpoints ${url} cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: http: no Host in request URL
; error #1: x509: certificate is valid for qe-jialiu-master-etcd-2.openshift-snvl2.internal, not qe-jialiu-master-etcd-2
; error #2: x509: certificate is valid for qe-jialiu-master-etcd-3.openshift-snvl2.internal, not qe-jialiu-master-etcd-3
; error #3: x509: certificate is valid for qe-jialiu-master-etcd-1.openshift-snvl2.internal, not qe-jialiu-master-etcd-1

error #0: http: no Host in request URL
error #1: x509: certificate is valid for qe-jialiu-master-etcd-2.openshift-snvl2.internal, not qe-jialiu-master-etcd-2
error #2: x509: certificate is valid for qe-jialiu-master-etcd-3.openshift-snvl2.internal, not qe-jialiu-master-etcd-3
error #3: x509: certificate is valid for qe-jialiu-master-etcd-1.openshift-snvl2.internal, not qe-jialiu-master-etcd-1


Expected results:
Install is passed.

Additional info:
Maybe this is relative to https://bugzilla.redhat.com/show_bug.cgi?id=1623335, when verify that bug, I was using "qeos10" OSP, but not "snvl2".

Comment 1 Johnny Liu 2018-09-20 12:26:56 UTC

Created attachment 1485120 [details]
installation log with inventory file embedded for snvl2

Comment 4 Maciej Szulik 2018-09-20 14:44:47 UTC

I examined the failed etcd logs and I'm seeing lots of:

2018-09-20 10:19:27.923385 W | etcdserver: failed to send out heartbeat on time (exceeded the 500ms timeout for 225.909034ms)
2018-09-20 10:19:27.923458 W | etcdserver: server is likely overloaded

with the timeout exceeding allowed timout by even 1 second!

This means there's a major problem with networking on that environment. I'm sending it over to networking team to investigate what's going on. It might be a problem with that particular environment.

Comment 11 Johnny Liu 2018-09-21 03:08:19 UTC

This issue do not happen when installing a cluster with cloudprovider enabled on snvl2 OSP.

# grep "etcdClientInfo" /etc/origin/master/master-config.yaml -A 7
etcdClientInfo:
  ca: master.etcd-ca.crt
  certFile: master.etcd-client.crt
  keyFile: master.etcd-client.key
  urls:
  - https://qe-jialiu-master-etcd-1:2379
  - https://qe-jialiu-master-etcd-2:2379
  - https://qe-jialiu-master-etcd-3:2379

[root@qe-jialiu-master-etcd-1 ~]# hostname 
qe-jialiu-master-etcd-1

[root@qe-jialiu-master-etcd-1 ~]# hostname -f
qe-jialiu-master-etcd-1.openshift-snvl2.internal

Comment 12 Vadim Rutkovsky 2018-09-21 14:09:29 UTC

Can't connect to both of these clusters anymore, so not much info can be fetched.

Could you try setting up those again? Make sure the facts on the target machines are erased so that the cloudprovider fix would be in action

Comment 14 Scott Dodson 2018-09-21 19:05:16 UTC

There are a handful of inventory variables that differ between the two clusters which aren't directly consumed by openshift-ansible. Can you tell me what these variables affect?

# Good cluster
enable_internal_dns=true
internal_dns_subdomain=int.0921-1ub.qe.rhcloud.com


# Bad Cluster
enable_internal_dns=
internal_dns_subdomain is unset
openshift_node_dnsmasq_additional_config_file={{ lookup('env', 'WORKSPACE') }}/private-openshift-misc/v3_non-vpn/dnsmasq_additional_config-openstack_snvl2

the additional_dnsmasq_config content is
server=/openshift-snvl2.internal/192.168.100.2 


This sends some queries in the search path to an alternate nameserver than the default. It seems necessary but it seems to result in some strange behavior.

# good
# host `hostname`
preserve-jialiu-good-mrre-1.int.0921-1ub.qe.rhcloud.com has address 172.16.122.27

# bad
# host `hostname`
preserve-jialiu-bad-mrre-1.openshift-snvl2.internal has address 192.168.100.14
Host preserve-jialiu-bad-mrre-1.openshift-snvl2.internal not found: 5(REFUSED)
Host preserve-jialiu-bad-mrre-1.openshift-snvl2.internal not found: 5(REFUSED)

Comment 15 Johnny Liu 2018-09-22 15:55:07 UTC

(In reply to Scott Dodson from comment #14)
> There are a handful of inventory variables that differ between the two
> clusters which aren't directly consumed by openshift-ansible. Can you tell
> me what these variables affect?
> 
> # Good cluster
> enable_internal_dns=true
> internal_dns_subdomain=int.0921-1ub.qe.rhcloud.com

The good cluster is running on centralci OSP, instance name is not resolvable, so for testing, QE have to add some A record in external DNS server for these instance name.
> 
> 
> # Bad Cluster
> enable_internal_dns=
> internal_dns_subdomain is unset
> openshift_node_dnsmasq_additional_config_file={{ lookup('env', 'WORKSPACE')
> }}/private-openshift-misc/v3_non-vpn/dnsmasq_additional_config-
> openstack_snvl2
> 
> the additional_dnsmasq_config content is
> server=/openshift-snvl2.internal/192.168.100.2 
> 
The bad cluster is running on snvl2 OSP, which is setup and controlled by QE, I enabled the openstack native internal DNS service, so any instance name become resolved, but the DNS service IP is 192.168.100.2.

Comment 16 Johnny Liu 2018-09-22 16:50:50 UTC

> This sends some queries in the search path to an alternate nameserver than the > default. It seems necessary but it seems to result in some strange behavior.
>
> # good
> # host `hostname`
> preserve-jialiu-good-mrre-1.int.0921-1ub.qe.rhcloud.com has address 172.16.122.27
>
> # bad
> # host `hostname`
> preserve-jialiu-bad-mrre-1.openshift-snvl2.internal has address 192.168.100.14
> Host preserve-jialiu-bad-mrre-1.openshift-snvl2.internal not found: 5(REFUSED)
> Host preserve-jialiu-bad-mrre-1.openshift-snvl2.internal not found: 5(REFUSED)

I just try one more installation on snvl2 OSP with enable_internal_dns=true set just like what was did on centralci OSP, the cluster is set up successfully.

Do you have idea why these strange behavior happened, how to fix it?

Comment 18 Johnny Liu 2018-09-25 03:22:26 UTC

Summary:
install on snvl2 OSP + no cloudprovider + enable_internal_dns=false, FAIL
install on snvl2 OSP + cloudprovider + enable_internal_dns=false, PASS
install on snvl2 OSP + no cloudprovider + enable_internal_dns=true, PASS

Comment 19 Vadim Rutkovsky 2018-10-01 09:06:28 UTC

What does `enable_internal_dns` do and why is it giving a wrong info about hostnames?

On AWS in our test/ci playbooks we're simply setting `hostname` to FQDN, so it might be internal DNS server issue

Comment 20 Johnny Liu 2018-10-08 10:20:34 UTC

(In reply to Vadim Rutkovsky from comment #19)
> What does `enable_internal_dns` do and why is it giving a wrong info about
> hostnames?


Pls refer to https://bugzilla.redhat.com/show_bug.cgi?id=1631368#c15. Once `enable_internal_dns`, QE's playbook would create A record in external dynnet DNS server, add search domain to /etc/resolv.conf to make host's short hostname become resolvable, e.g:

On host A:
short hostname: qe-jialiu-master
Will create A dns record - qe-jialiu-master.int.1008-m02.qe.rhcloud.com for it in dynnet DNS server
adding search domain "int.1008-m02.qe.rhcloud.com" in /etc/resolv.conf
keep host's hostname as "qe-jialiu-master", then qe-jialiu-master become resolvable. when enable_internal_dns=true, the install is completed successfully, hostname give a correct info.

When enable_internal_dns=false, will try to utilize openstack internal dns resolution, will not interact with external public dynnet dns server, all the host's short hostname could be resolved by openstack's internal dns server - 192.168.100.2, then we have to use openshift_node_dnsmasq_additional_config_file to configure dnsmasq to get the extra upstream dns server, then the installation failed as my initial report.

> On AWS in our test/ci playbooks we're simply setting `hostname` to FQDN,
Is that a must? I think different user maybe configure their cluster differently, not every user like use FQDN as `hostname`.

Comment 21 Scott Dodson 2018-11-01 14:09:56 UTC

Can this be tested with the recent changes related to hostname configuration?

Comment 22 Johnny Liu 2018-11-02 06:15:03 UTC

Re-test this bug with openshift-ansible-3.11.36-1.git.0.2213c76.el7.noarch, still failed. Totally the same behavior like before.

Comment 23 Vadim Rutkovsky 2019-01-29 10:16:48 UTC

(In reply to Johnny Liu from comment #20)
> > On AWS in our test/ci playbooks we're simply setting `hostname` to FQDN,
> Is that a must? I think different user maybe configure their cluster
> differently, not every user like use FQDN as `hostname`.

If you're not using cloudprovider `hostname -f` has to return FQDN, otherwise kubelet gets confused.

Is this issue still occurs? Can `no cloudprovider + enable_internal_dns=false` case be adjusted to set `hostname -f` to FQDN?

Comment 24 Johnny Liu 2019-02-15 06:38:47 UTC

(In reply to Vadim Rutkovsky from comment #23)
> Is this issue still occurs? 
Re-test this bug with openshift-ansible-3.11.83-1.git.0.937d518.el7.noarch, still reproduced.

TASK [Gather Cluster facts] ****************************************************
changed: [dhcp-89-156.sjc.redhat.com] => {"ansible_facts": {"openshift": {"common": {"all_hostnames": ["qe-jialiu311-master-etcd-nfs-1", "dhcp-89-156.sjc.redhat.com", "192.168.100.6"], "config_base": "/etc/origin", "dns_domain": "cluster.local", "generate_no_proxy_hosts": true, "hostname": "qe-jialiu311-master-etcd-nfs-1", "internal_hostnames": ["qe-jialiu311-master-etcd-nfs-1", "192.168.100.6"], "ip": "192.168.100.6", "kube_svc_ip": "172.30.0.1", "portal_net": "172.30.0.0/16", "public_hostname": "dhcp-89-156.sjc.redhat.com", "public_ip": "192.168.100.6", "raw_hostname": "qe-jialiu311-master-etcd-nfs-1"}, "current_config": {}}}, "changed": true}

# etcdctl --ca-file "${ca_file}" --cert-file "${cert_file}" --key-file "${key_file}" --endpoints ${url} cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate is valid for qe-jialiu311-master-etcd-nfs-1.openshift-snvl2.internal, not qe-jialiu311-master-etcd-nfs-1
; error #1: http: no Host in request URL

error #0: x509: certificate is valid for qe-jialiu311-master-etcd-nfs-1.openshift-snvl2.internal, not qe-jialiu311-master-etcd-nfs-1
error #1: http: no Host in request URL

>Can `no cloudprovider + enable_internal_dns=false` case be adjusted to set `hostname -f` to FQDN?
The output of `hostname -f` == FQDN

# hostname
qe-jialiu311-master-etcd-nfs-1

# hostname -f
qe-jialiu311-master-etcd-nfs-1.openshift-snvl2.internal

# host `hostname`
qe-jialiu311-master-etcd-nfs-1.openshift-snvl2.internal has address 192.168.100.6
Host qe-jialiu311-master-etcd-nfs-1.openshift-snvl2.internal not found: 5(REFUSED)
Host qe-jialiu311-master-etcd-nfs-1.openshift-snvl2.internal not found: 5(REFUSED)

I am curious why hostname in etcd certificate is mismatched with the etcd endpoint url's hostname configured in master config in the same install.

Comment 25 Vadim Rutkovsky 2019-02-28 15:19:33 UTC

(In reply to Johnny Liu from comment #24)
> >Can `no cloudprovider + enable_internal_dns=false` case be adjusted to set `hostname -f` to FQDN?
> The output of `hostname -f` == FQDN
> 
> # hostname
> qe-jialiu311-master-etcd-nfs-1
> 
> # hostname -f
> qe-jialiu311-master-etcd-nfs-1.openshift-snvl2.internal

My bad, I meant `hostname == hostname -f`. Could you try that again?

Comment 26 Scott Dodson 2019-03-14 13:23:02 UTC

The problem is believed to be related to the fact that `hostname` is not the same as `hostname -f` which is a requirement of 3.10 and 3.11. With no customer cases attached we're closing this.

Note You need to log in before you can comment on or make changes to this bug.