1626812 – Install failed due to etcd connection url hostname is mismatched with the one in cert files

Bug 1626812 - Install failed due to etcd connection url hostname is mismatched with the one in cert files

Summary: Install failed due to etcd connection url hostname is mismatched with the one...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Vadim Rutkovsky
QA Contact:	Weihua Meng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-09 01:37 UTC by Weihua Meng
Modified:	2018-12-21 15:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: any openstack install assumed openstack cloudprovider would be enabled Consequence: openstack metadata was used to set hostnames, breaking upgrades on installs which didn't have cloudprovider enabled Fix: openstack metadata is used only when openstack cloudprovider is enabled Result: upgrade on openstack with custom hostnames and cloudprovider disables succeeds
Clone Of:
Clones:	1626935 (view as bug list)
Environment:
Last Closed:	2018-12-21 15:23:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
installation log with inventory file embedded for .28 build (3.81 MB, text/plain) 2018-09-11 03:41 UTC, Johnny Liu	no flags	Details
installation log with inventory file embedded for .32 build (2.33 MB, text/plain) 2018-09-11 03:42 UTC, Johnny Liu	no flags	Details
View All

Description Weihua Meng 2018-09-09 01:37:08 UTC

Description of problem:
Install failed 3.11.0-0.32.0 due to api pod keep restarting

Version-Release number of the following components:
openshift-ansible-3.11.0-0.32.0.git.0.b27b349.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. install OCP with openshift-ansible-3.11.0-0.32.0.git.0.b27b349.el7.noarch

Actual results:
Install failed

Expected results:
Install succeeds

Additional info:
  1. Hosts:    host-xxxx.host.centralci.eng.rdu2.redhat.com
     Play:     Configure masters
     Task:     Report control plane errors
     Message:  Control plane pods didn't come up

Comment 8 Johnny Liu 2018-09-11 03:41:21 UTC

Created attachment 1482256 [details]
installation log with inventory file embedded for .28 build

Comment 9 Johnny Liu 2018-09-11 03:42:10 UTC

Created attachment 1482257 [details]
installation log with inventory file embedded for .32 build

Comment 10 Johnny Liu 2018-09-11 05:14:05 UTC

This issue looks like mismatched facts for hostname used for etcd and master.

Comment 17 Vadim Rutkovsky 2018-09-11 14:17:42 UTC

It seems its caused by https://github.com/openshift/openshift-ansible/pull/9876. Created PR to revert that in master - https://github.com/openshift/openshift-ansible/pull/9999

Comment 18 Wei Sun 2018-09-13 02:10:58 UTC

3.11 PR 9980 has been merged to openshift-ansible-3.11.2-1,please check the bug.

Comment 19 Weihua Meng 2018-09-14 02:16:09 UTC

Fixed.

openshift-ansible-3.11.4-1.git.0.d727082.el7_5.noarch

Kernel Version: 3.10.0-862.11.6.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)

Comment 20 Johnny Liu 2018-09-20 11:52:57 UTC

QE have two deployment of OSP, one is "snvl2", another is "qeos10".

Today when I was trying to install a cluster without cloudprovider enabled on "snvl2" OSP, failed due to the same issue in initial report.
[root@qe-jialiu-master-etcd-1 ~]# root_path="/etc/origin/master";
[root@qe-jialiu-master-etcd-1 ~]# ca_file=${root_path}/$(grep -A 6 "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep "ca" | awk -F": " '{print $2}');
[root@qe-jialiu-master-etcd-1 ~]# cert_file=${root_path}/$(grep -A 6 "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep "certFile" | awk -F": " '{print $2}');
[root@qe-jialiu-master-etcd-1 ~]# key_file=${root_path}/$(grep -A 6 "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep "keyFile" | awk -F": " '{print $2}');
[root@qe-jialiu-master-etcd-1 ~]# for i in `grep -A 8  "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep -A 3 "urls:" | grep  -v "urls:" | awk -F"- " '{print $2}'`; do url="$url,$i"; done
[root@qe-jialiu-master-etcd-1 ~]# etcdctl --ca-file "${ca_file}" --cert-file "${cert_file}" --key-file "${key_file}" --endpoints ${url} cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: http: no Host in request URL
; error #1: x509: certificate is valid for qe-jialiu-master-etcd-2.openshift-snvl2.internal, not qe-jialiu-master-etcd-2
; error #2: x509: certificate is valid for qe-jialiu-master-etcd-3.openshift-snvl2.internal, not qe-jialiu-master-etcd-3
; error #3: x509: certificate is valid for qe-jialiu-master-etcd-1.openshift-snvl2.internal, not qe-jialiu-master-etcd-1

error #0: http: no Host in request URL
error #1: x509: certificate is valid for qe-jialiu-master-etcd-2.openshift-snvl2.internal, not qe-jialiu-master-etcd-2
error #2: x509: certificate is valid for qe-jialiu-master-etcd-3.openshift-snvl2.internal, not qe-jialiu-master-etcd-3
error #3: x509: certificate is valid for qe-jialiu-master-etcd-1.openshift-snvl2.internal, not qe-jialiu-master-etcd-1


But install on "qeos10", no such issue.

Pls go throug the two install log, search "openshift_master_etcd_hosts" in the log, you will find the difference, one is using short hostname, another is using a fdqn hostname.

When I was verifying https://bugzilla.redhat.com/show_bug.cgi?id=1623335, I am using "qeos10" OSP, but not "snvl2".

Comment 21 Vadim Rutkovsky 2018-09-20 12:02:53 UTC

(In reply to Johnny Liu from comment #20)
> QE have two deployment of OSP, one is "snvl2", another is "qeos10".

Please open a new issue for that. There is no version, inventory or playbook logs to find out what's wrong with URLs or etcd certs

Comment 22 Johnny Liu 2018-09-20 12:27:32 UTC

(In reply to Vadim Rutkovsky from comment #21)
> (In reply to Johnny Liu from comment #20)
> > QE have two deployment of OSP, one is "snvl2", another is "qeos10".
> 
> Please open a new issue for that. There is no version, inventory or playbook
> logs to find out what's wrong with URLs or etcd certs

Done. https://bugzilla.redhat.com/show_bug.cgi?id=1631368

Comment 23 Luke Meyer 2018-12-21 15:23:44 UTC

Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.

Note You need to log in before you can comment on or make changes to this bug.