Bug 1626812 - Install failed due to etcd connection url hostname is mismatched with the one in cert files
Summary: Install failed due to etcd connection url hostname is mismatched with the one...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Vadim Rutkovsky
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-09 01:37 UTC by Weihua Meng
Modified: 2018-12-21 15:23 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: any openstack install assumed openstack cloudprovider would be enabled Consequence: openstack metadata was used to set hostnames, breaking upgrades on installs which didn't have cloudprovider enabled Fix: openstack metadata is used only when openstack cloudprovider is enabled Result: upgrade on openstack with custom hostnames and cloudprovider disables succeeds
Clone Of:
: 1626935 (view as bug list)
Environment:
Last Closed: 2018-12-21 15:23:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
installation log with inventory file embedded for .28 build (3.81 MB, text/plain)
2018-09-11 03:41 UTC, Johnny Liu
no flags Details
installation log with inventory file embedded for .32 build (2.33 MB, text/plain)
2018-09-11 03:42 UTC, Johnny Liu
no flags Details

Description Weihua Meng 2018-09-09 01:37:08 UTC
Description of problem:
Install failed 3.11.0-0.32.0 due to api pod keep restarting

Version-Release number of the following components:
openshift-ansible-3.11.0-0.32.0.git.0.b27b349.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. install OCP with openshift-ansible-3.11.0-0.32.0.git.0.b27b349.el7.noarch

Actual results:
Install failed

Expected results:
Install succeeds

Additional info:
  1. Hosts:    host-xxxx.host.centralci.eng.rdu2.redhat.com
     Play:     Configure masters
     Task:     Report control plane errors
     Message:  Control plane pods didn't come up

Comment 8 Johnny Liu 2018-09-11 03:41:21 UTC
Created attachment 1482256 [details]
installation log with inventory file embedded for .28 build

Comment 9 Johnny Liu 2018-09-11 03:42:10 UTC
Created attachment 1482257 [details]
installation log with inventory file embedded for .32 build

Comment 10 Johnny Liu 2018-09-11 05:14:05 UTC
This issue looks like mismatched facts for hostname used for etcd and master.

Comment 17 Vadim Rutkovsky 2018-09-11 14:17:42 UTC
It seems its caused by https://github.com/openshift/openshift-ansible/pull/9876. Created PR to revert that in master - https://github.com/openshift/openshift-ansible/pull/9999

Comment 18 Wei Sun 2018-09-13 02:10:58 UTC
3.11 PR 9980 has been merged to openshift-ansible-3.11.2-1,please check the bug.

Comment 19 Weihua Meng 2018-09-14 02:16:09 UTC
Fixed.

openshift-ansible-3.11.4-1.git.0.d727082.el7_5.noarch

Kernel Version: 3.10.0-862.11.6.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)

Comment 20 Johnny Liu 2018-09-20 11:52:57 UTC
QE have two deployment of OSP, one is "snvl2", another is "qeos10".

Today when I was trying to install a cluster without cloudprovider enabled on "snvl2" OSP, failed due to the same issue in initial report.
[root@qe-jialiu-master-etcd-1 ~]# root_path="/etc/origin/master";
[root@qe-jialiu-master-etcd-1 ~]# ca_file=${root_path}/$(grep -A 6 "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep "ca" | awk -F": " '{print $2}');
[root@qe-jialiu-master-etcd-1 ~]# cert_file=${root_path}/$(grep -A 6 "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep "certFile" | awk -F": " '{print $2}');
[root@qe-jialiu-master-etcd-1 ~]# key_file=${root_path}/$(grep -A 6 "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep "keyFile" | awk -F": " '{print $2}');
[root@qe-jialiu-master-etcd-1 ~]# for i in `grep -A 8  "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep -A 3 "urls:" | grep  -v "urls:" | awk -F"- " '{print $2}'`; do url="$url,$i"; done
[root@qe-jialiu-master-etcd-1 ~]# etcdctl --ca-file "${ca_file}" --cert-file "${cert_file}" --key-file "${key_file}" --endpoints ${url} cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: http: no Host in request URL
; error #1: x509: certificate is valid for qe-jialiu-master-etcd-2.openshift-snvl2.internal, not qe-jialiu-master-etcd-2
; error #2: x509: certificate is valid for qe-jialiu-master-etcd-3.openshift-snvl2.internal, not qe-jialiu-master-etcd-3
; error #3: x509: certificate is valid for qe-jialiu-master-etcd-1.openshift-snvl2.internal, not qe-jialiu-master-etcd-1

error #0: http: no Host in request URL
error #1: x509: certificate is valid for qe-jialiu-master-etcd-2.openshift-snvl2.internal, not qe-jialiu-master-etcd-2
error #2: x509: certificate is valid for qe-jialiu-master-etcd-3.openshift-snvl2.internal, not qe-jialiu-master-etcd-3
error #3: x509: certificate is valid for qe-jialiu-master-etcd-1.openshift-snvl2.internal, not qe-jialiu-master-etcd-1


But install on "qeos10", no such issue.

Pls go throug the two install log, search "openshift_master_etcd_hosts" in the log, you will find the difference, one is using short hostname, another is using a fdqn hostname.

When I was verifying https://bugzilla.redhat.com/show_bug.cgi?id=1623335, I am using "qeos10" OSP, but not "snvl2".

Comment 21 Vadim Rutkovsky 2018-09-20 12:02:53 UTC
(In reply to Johnny Liu from comment #20)
> QE have two deployment of OSP, one is "snvl2", another is "qeos10".

Please open a new issue for that. There is no version, inventory or playbook logs to find out what's wrong with URLs or etcd certs

Comment 22 Johnny Liu 2018-09-20 12:27:32 UTC
(In reply to Vadim Rutkovsky from comment #21)
> (In reply to Johnny Liu from comment #20)
> > QE have two deployment of OSP, one is "snvl2", another is "qeos10".
> 
> Please open a new issue for that. There is no version, inventory or playbook
> logs to find out what's wrong with URLs or etcd certs

Done. https://bugzilla.redhat.com/show_bug.cgi?id=1631368

Comment 23 Luke Meyer 2018-12-21 15:23:44 UTC
Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.


Note You need to log in before you can comment on or make changes to this bug.