Bug 1626812

Summary:

Install failed due to etcd connection url hostname is mismatched with the one in cert files

Product:

OpenShift Container Platform

Reporter:

Weihua Meng <wmeng>

Component:

Installer

Assignee:

Vadim Rutkovsky <vrutkovs>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Weihua Meng <wmeng>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.11.0

CC:

aos-bugs, jialiu, jokerman, mmccomas, shlao, vrutkovs, wmeng, wsun

Target Milestone:

---

Keywords:

Regression

Target Release:

3.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: any openstack install assumed openstack cloudprovider would be enabled Consequence: openstack metadata was used to set hostnames, breaking upgrades on installs which didn't have cloudprovider enabled Fix: openstack metadata is used only when openstack cloudprovider is enabled Result: upgrade on openstack with custom hostnames and cloudprovider disables succeeds

Story Points:

---

Clone Of:

Clones:

1626935 (view as bug list)

Environment:

Last Closed:

2018-12-21 15:23:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
installation log with inventory file embedded for .28 build	none
installation log with inventory file embedded for .32 build	none

Description Weihua Meng 2018-09-09 01:37:08 UTC

Description of problem:
Install failed 3.11.0-0.32.0 due to api pod keep restarting

Version-Release number of the following components:
openshift-ansible-3.11.0-0.32.0.git.0.b27b349.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. install OCP with openshift-ansible-3.11.0-0.32.0.git.0.b27b349.el7.noarch

Actual results:
Install failed

Expected results:
Install succeeds

Additional info:
  1. Hosts:    host-xxxx.host.centralci.eng.rdu2.redhat.com
     Play:     Configure masters
     Task:     Report control plane errors
     Message:  Control plane pods didn't come up

Comment 8 Johnny Liu 2018-09-11 03:41:21 UTC

Created attachment 1482256 [details]
installation log with inventory file embedded for .28 build

Comment 9 Johnny Liu 2018-09-11 03:42:10 UTC

Created attachment 1482257 [details]
installation log with inventory file embedded for .32 build

Comment 10 Johnny Liu 2018-09-11 05:14:05 UTC

This issue looks like mismatched facts for hostname used for etcd and master.

Comment 17 Vadim Rutkovsky 2018-09-11 14:17:42 UTC

It seems its caused by https://github.com/openshift/openshift-ansible/pull/9876. Created PR to revert that in master - https://github.com/openshift/openshift-ansible/pull/9999

Comment 18 Wei Sun 2018-09-13 02:10:58 UTC

3.11 PR 9980 has been merged to openshift-ansible-3.11.2-1,please check the bug.

Comment 19 Weihua Meng 2018-09-14 02:16:09 UTC

Fixed.

openshift-ansible-3.11.4-1.git.0.d727082.el7_5.noarch

Kernel Version: 3.10.0-862.11.6.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)

Comment 20 Johnny Liu 2018-09-20 11:52:57 UTC

QE have two deployment of OSP, one is "snvl2", another is "qeos10".

Today when I was trying to install a cluster without cloudprovider enabled on "snvl2" OSP, failed due to the same issue in initial report.
[root@qe-jialiu-master-etcd-1 ~]# root_path="/etc/origin/master";
[root@qe-jialiu-master-etcd-1 ~]# ca_file=${root_path}/$(grep -A 6 "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep "ca" | awk -F": " '{print $2}');
[root@qe-jialiu-master-etcd-1 ~]# cert_file=${root_path}/$(grep -A 6 "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep "certFile" | awk -F": " '{print $2}');
[root@qe-jialiu-master-etcd-1 ~]# key_file=${root_path}/$(grep -A 6 "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep "keyFile" | awk -F": " '{print $2}');
[root@qe-jialiu-master-etcd-1 ~]# for i in `grep -A 8  "etcdClientInfo:" /etc/origin/master/master-config.yaml | grep -A 3 "urls:" | grep  -v "urls:" | awk -F"- " '{print $2}'`; do url="$url,$i"; done
[root@qe-jialiu-master-etcd-1 ~]# etcdctl --ca-file "${ca_file}" --cert-file "${cert_file}" --key-file "${key_file}" --endpoints ${url} cluster-health
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: http: no Host in request URL
; error #1: x509: certificate is valid for qe-jialiu-master-etcd-2.openshift-snvl2.internal, not qe-jialiu-master-etcd-2
; error #2: x509: certificate is valid for qe-jialiu-master-etcd-3.openshift-snvl2.internal, not qe-jialiu-master-etcd-3
; error #3: x509: certificate is valid for qe-jialiu-master-etcd-1.openshift-snvl2.internal, not qe-jialiu-master-etcd-1

error #0: http: no Host in request URL
error #1: x509: certificate is valid for qe-jialiu-master-etcd-2.openshift-snvl2.internal, not qe-jialiu-master-etcd-2
error #2: x509: certificate is valid for qe-jialiu-master-etcd-3.openshift-snvl2.internal, not qe-jialiu-master-etcd-3
error #3: x509: certificate is valid for qe-jialiu-master-etcd-1.openshift-snvl2.internal, not qe-jialiu-master-etcd-1


But install on "qeos10", no such issue.

Pls go throug the two install log, search "openshift_master_etcd_hosts" in the log, you will find the difference, one is using short hostname, another is using a fdqn hostname.

When I was verifying https://bugzilla.redhat.com/show_bug.cgi?id=1623335, I am using "qeos10" OSP, but not "snvl2".

Comment 21 Vadim Rutkovsky 2018-09-20 12:02:53 UTC

(In reply to Johnny Liu from comment #20)
> QE have two deployment of OSP, one is "snvl2", another is "qeos10".

Please open a new issue for that. There is no version, inventory or playbook logs to find out what's wrong with URLs or etcd certs

Comment 22 Johnny Liu 2018-09-20 12:27:32 UTC

(In reply to Vadim Rutkovsky from comment #21)
> (In reply to Johnny Liu from comment #20)
> > QE have two deployment of OSP, one is "snvl2", another is "qeos10".
> 
> Please open a new issue for that. There is no version, inventory or playbook
> logs to find out what's wrong with URLs or etcd certs

Done. https://bugzilla.redhat.com/show_bug.cgi?id=1631368

Comment 23 Luke Meyer 2018-12-21 15:23:44 UTC

Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.