Bug 1566629

Summary:

Installer failing - x509: certificate is not valid

Product:

OpenShift Container Platform

Reporter:

Vikas Laad <vlaad>

Component:

Installer

Assignee:

Mike Fiedler <mifiedle>

Status:

CLOSED DUPLICATE

QA Contact:

Johnny Liu <jialiu>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.10.0

CC:

aos-bugs, jeder, jokerman, mifiedle, mmccomas, vlaad, wmeng

Target Milestone:

---

Target Release:

3.10.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

aos-scalability-310

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-05-02 14:17:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ansible log	none
ansible log with -vvv	none
ansible -vvv log for failing and successful installs	none

Description Vikas Laad 2018-04-12 16:09:55 UTC

Description of problem:
Installer is failing at the following step

TASK [openshift_service_catalog : Verify that the web console is running]  
FAILED - RETRYING: Verify that the web console is running (60 retries left).
FAILED - RETRYING: Verify that the web console is running (59 retries left).
FAILED - RETRYING: Verify that the web console is running (58 retries left).
FAILED - RETRYING: Verify that the web console is running (57 retries left).

api-server pod is in CrashLoopBackOff
kube-service-catalog    apiserver-tqb6n                                                  0/1       CrashLoopBackOff   6          8m 
kube-service-catalog    controller-manager-78rx5                                         1/1       Running            0          42m                                              

Logs from apiserver pod
root@ip-172-31-16-115: ~ # oc logs -n kube-service-catalog    apiserver-tqb6n
I0412 16:03:51.795294       1 feature_gate.go:184] feature gates: map[OriginatingIdentity:true]
I0412 16:03:51.795465       1 hyperkube.go:188] Service Catalog version v0.0.0-master+$Format:%h$ (built 2018-04-11T19:40:47Z)
W0412 16:03:52.016238       1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.24.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: x509: certificate is valid for 172.30.0.1, 172.31.16.115, 52.39.19.98, not 172.24.0.1


Version-Release number of the following components:
openshift-ansible head is f93c5ce8ec9fef0044e9dbf9e5fc82271e26f3a6

rpm -q ansible
ansible-2.4.3.0-1.el7ae.noarch

ansible --version
ansible 2.4.3.0
  config file = /root/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible:
Always

Steps to Reproduce:
1. run deploy_cluster playbook with attached inventory

Actual results:
Installer fails, please see ansible log and inventory attached.

Expected results:
Installer should complete.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 Vikas Laad 2018-04-12 16:13:40 UTC

Created attachment 1420921 [details]
ansible log

Comment 3 Scott Dodson 2018-04-12 16:54:58 UTC

This is happening because kube_svc_ip is not being calculated correctly based on portal_net. At first I thought that we'd lost the kube_svc_ip being appended to openshift.common.all_hostnames but it must be somewhere else in the facts refactoring.

Comment 4 Michael Gugino 2018-04-12 17:22:26 UTC

I've tested our function that generates the first IP for a network, it should be working as designed.

My initial feeling is that this is related to bootstrapping in some way.

Comment 5 Michael Gugino 2018-04-12 18:49:07 UTC

@Vikas

The node labels and selectors are using obsolete versions.  You'll need to adjust your inventory to consume the new node labeling scheme in 3.10.

I'm still investigating this issue.

Comment 6 Vikas Laad 2018-04-12 19:01:40 UTC

(In reply to Michael Gugino from comment #5)
> @Vikas
> 
> The node labels and selectors are using obsolete versions.  You'll need to
> adjust your inventory to consume the new node labeling scheme in 3.10.

OK, I will change the inventory file.

Comment 7 Michael Gugino 2018-04-13 19:42:44 UTC

I don't see any obvious reason for this to happen.  Can you provide debugging output with -vvv ?

Comment 8 Vikas Laad 2018-04-14 15:10:21 UTC

Created attachment 1421781 [details]
ansible log with -vvv

Comment 9 Michael Gugino 2018-04-16 18:01:40 UTC

172.30.0.0/16 is the default for openshift_portal_net if not otherwise set.

It appears this cluster was already built (at least partially) and the portal net has changed.  At the very least, the attached log shows that the certs are already present.

I will try your specific portal net settings to see if those values are being disregarded.

Comment 10 Michael Gugino 2018-04-16 18:54:54 UTC

I was unable to replicate this error with the portal nets provided.

Can you try deploying against an entirely new set of hosts?

Comment 11 Vikas Laad 2018-04-16 19:01:37 UTC

Hi Michael,

there were few BZs filed https://bugzilla.redhat.com/show_bug.cgi?id=1568031 and https://bugzilla.redhat.com/show_bug.cgi?id=1565442

I will try again after these blocker bugs are fixed.

Comment 12 Mike Fiedler 2018-04-17 14:04:08 UTC

With v3.10.0-0.21.0 the router and registry deployments still fail with this error.

openshift_master_portal_net=172.24.0.0/14
openshift_portal_net=172.24.0.0/14
osm_cluster_network_cidr=172.20.0.0/14

These values have always worked in the past.   Will try taking the defaults for these.

Comment 13 Mike Fiedler 2018-04-18 00:18:56 UTC

Taking the defaults for network cidrs resulted in a working cluster for me.   We use the modfied values in comment 12 for cluster horizontal scalability testing.

Comment 14 Michael Gugino 2018-04-18 16:54:00 UTC

I'm still looking for -vvv output from a cluster that doesn't already have certs generated.  Rerunning the installer does not generate new certs if they are already present.

Comment 15 Mike Fiedler 2018-04-18 16:58:20 UTC

I'll run an install with -vvv now.   It will possibly be two different ansible-playbook invocations due to https://bugzilla.redhat.com/show_bug.cgi?id=1568583 forcing a reboot after control-plane startup.   But, I'm on it.

Comment 16 Mike Fiedler 2018-04-18 18:20:50 UTC

I think I narrowed this down some.  It appears to be an issue with public vs private hostnames and nothing to do with the CIDRs

=====================
Install method 1 - fails with this bz.  Corresponds to inventory.1 and ansible.log.1

no openshift_master_cluster_hostname - take the default
no openshift_master_cluster_public_hostname - take the default

Each host in the inventory uses public AWS hostname and specifies openshift_public_hostname for each host (which might be redundant but has worked in the past).

====================

Install method 2 - succeeds for the same set of hosts.  Corresponds to inventory.2 and ansible.log.2

openshift_master_cluster_hostname set to AWS private hostname
openshift_master_cluster_public_hostname set to AWS public hostname
 
Each host in the inventory uses the AWS internal hostname

I'll attach a tarball with the logs and (redacted) inventories

Comment 17 Mike Fiedler 2018-04-18 18:25:22 UTC

Created attachment 1423692 [details]
ansible -vvv log for failing and successful installs

Comment 18 Mike Fiedler 2018-04-18 18:32:36 UTC

Well, hold up on comment 16 and the associated logs.  This is a different failure and may be related to stale .kube. Apologies.   Re-running the failure case again with a  completely clean system.

Comment 19 Mike Fiedler 2018-04-18 19:34:27 UTC

Comment 15 was incorrect.  With the "bad inventory" (inventory.2) from that comment, I am also able to install successfully.   I am removing TestBlocker and lowering the severity.   The next puddle will have several fixes for the incorrect control-plane and ose-node images.  If we cannot reproduce on that puddle, we should close this one out.

Comment 20 Mike Fiedler 2018-04-20 17:56:41 UTC

Assigning to myself to attempt re-create on the next puddle.   We hit this yesterday on the latest openshift-ansible master but my hunch is it is a side effect of often having to reboot the master halfway through the install because of node NotReady issues such as bug 1568583

Comment 21 Scott Dodson 2018-05-02 14:17:53 UTC

I'm relatively certain that root cause is 1571992

*** This bug has been marked as a duplicate of bug 1571992 ***