Bug 1566629 - Installer failing - x509: certificate is not valid
Summary: Installer failing - x509: certificate is not valid
Keywords:
Status: CLOSED DUPLICATE of bug 1571992
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.10.0
Assignee: Mike Fiedler
QA Contact: Johnny Liu
URL:
Whiteboard: aos-scalability-310
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-12 16:09 UTC by Vikas Laad
Modified: 2018-05-02 14:17 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-02 14:17:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ansible log (936.23 KB, text/plain)
2018-04-12 16:13 UTC, Vikas Laad
no flags Details
ansible log with -vvv (1.13 MB, text/plain)
2018-04-14 15:10 UTC, Vikas Laad
no flags Details
ansible -vvv log for failing and successful installs (647.64 KB, application/x-gzip)
2018-04-18 18:25 UTC, Mike Fiedler
no flags Details

Description Vikas Laad 2018-04-12 16:09:55 UTC
Description of problem:
Installer is failing at the following step

TASK [openshift_service_catalog : Verify that the web console is running]  
FAILED - RETRYING: Verify that the web console is running (60 retries left).
FAILED - RETRYING: Verify that the web console is running (59 retries left).
FAILED - RETRYING: Verify that the web console is running (58 retries left).
FAILED - RETRYING: Verify that the web console is running (57 retries left).

api-server pod is in CrashLoopBackOff
kube-service-catalog    apiserver-tqb6n                                                  0/1       CrashLoopBackOff   6          8m 
kube-service-catalog    controller-manager-78rx5                                         1/1       Running            0          42m                                              

Logs from apiserver pod
root@ip-172-31-16-115: ~ # oc logs -n kube-service-catalog    apiserver-tqb6n
I0412 16:03:51.795294       1 feature_gate.go:184] feature gates: map[OriginatingIdentity:true]
I0412 16:03:51.795465       1 hyperkube.go:188] Service Catalog version v0.0.0-master+$Format:%h$ (built 2018-04-11T19:40:47Z)
W0412 16:03:52.016238       1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.24.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: x509: certificate is valid for 172.30.0.1, 172.31.16.115, 52.39.19.98, not 172.24.0.1


Version-Release number of the following components:
openshift-ansible head is f93c5ce8ec9fef0044e9dbf9e5fc82271e26f3a6

rpm -q ansible
ansible-2.4.3.0-1.el7ae.noarch

ansible --version
ansible 2.4.3.0
  config file = /root/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible:
Always

Steps to Reproduce:
1. run deploy_cluster playbook with attached inventory

Actual results:
Installer fails, please see ansible log and inventory attached.

Expected results:
Installer should complete.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 Vikas Laad 2018-04-12 16:13:40 UTC
Created attachment 1420921 [details]
ansible log

Comment 3 Scott Dodson 2018-04-12 16:54:58 UTC
This is happening because kube_svc_ip is not being calculated correctly based on portal_net. At first I thought that we'd lost the kube_svc_ip being appended to openshift.common.all_hostnames but it must be somewhere else in the facts refactoring.

Comment 4 Michael Gugino 2018-04-12 17:22:26 UTC
I've tested our function that generates the first IP for a network, it should be working as designed.

My initial feeling is that this is related to bootstrapping in some way.

Comment 5 Michael Gugino 2018-04-12 18:49:07 UTC
@Vikas

The node labels and selectors are using obsolete versions.  You'll need to adjust your inventory to consume the new node labeling scheme in 3.10.

I'm still investigating this issue.

Comment 6 Vikas Laad 2018-04-12 19:01:40 UTC
(In reply to Michael Gugino from comment #5)
> @Vikas
> 
> The node labels and selectors are using obsolete versions.  You'll need to
> adjust your inventory to consume the new node labeling scheme in 3.10.

OK, I will change the inventory file.

Comment 7 Michael Gugino 2018-04-13 19:42:44 UTC
I don't see any obvious reason for this to happen.  Can you provide debugging output with -vvv ?

Comment 8 Vikas Laad 2018-04-14 15:10:21 UTC
Created attachment 1421781 [details]
ansible log with -vvv

Comment 9 Michael Gugino 2018-04-16 18:01:40 UTC
172.30.0.0/16 is the default for openshift_portal_net if not otherwise set.

It appears this cluster was already built (at least partially) and the portal net has changed.  At the very least, the attached log shows that the certs are already present.

I will try your specific portal net settings to see if those values are being disregarded.

Comment 10 Michael Gugino 2018-04-16 18:54:54 UTC
I was unable to replicate this error with the portal nets provided.

Can you try deploying against an entirely new set of hosts?

Comment 11 Vikas Laad 2018-04-16 19:01:37 UTC
Hi Michael,

there were few BZs filed https://bugzilla.redhat.com/show_bug.cgi?id=1568031 and https://bugzilla.redhat.com/show_bug.cgi?id=1565442

I will try again after these blocker bugs are fixed.

Comment 12 Mike Fiedler 2018-04-17 14:04:08 UTC
With v3.10.0-0.21.0 the router and registry deployments still fail with this error.

openshift_master_portal_net=172.24.0.0/14
openshift_portal_net=172.24.0.0/14
osm_cluster_network_cidr=172.20.0.0/14

These values have always worked in the past.   Will try taking the defaults for these.

Comment 13 Mike Fiedler 2018-04-18 00:18:56 UTC
Taking the defaults for network cidrs resulted in a working cluster for me.   We use the modfied values in comment 12 for cluster horizontal scalability testing.

Comment 14 Michael Gugino 2018-04-18 16:54:00 UTC
I'm still looking for -vvv output from a cluster that doesn't already have certs generated.  Rerunning the installer does not generate new certs if they are already present.

Comment 15 Mike Fiedler 2018-04-18 16:58:20 UTC
I'll run an install with -vvv now.   It will possibly be two different ansible-playbook invocations due to https://bugzilla.redhat.com/show_bug.cgi?id=1568583 forcing a reboot after control-plane startup.   But, I'm on it.

Comment 16 Mike Fiedler 2018-04-18 18:20:50 UTC
I think I narrowed this down some.  It appears to be an issue with public vs private hostnames and nothing to do with the CIDRs

=====================
Install method 1 - fails with this bz.  Corresponds to inventory.1 and ansible.log.1

no openshift_master_cluster_hostname - take the default
no openshift_master_cluster_public_hostname - take the default

Each host in the inventory uses public AWS hostname and specifies openshift_public_hostname for each host (which might be redundant but has worked in the past).

====================

Install method 2 - succeeds for the same set of hosts.  Corresponds to inventory.2 and ansible.log.2

openshift_master_cluster_hostname set to AWS private hostname
openshift_master_cluster_public_hostname set to AWS public hostname
 
Each host in the inventory uses the AWS internal hostname

I'll attach a tarball with the logs and (redacted) inventories

Comment 17 Mike Fiedler 2018-04-18 18:25:22 UTC
Created attachment 1423692 [details]
ansible -vvv log for failing and successful installs

Comment 18 Mike Fiedler 2018-04-18 18:32:36 UTC
Well, hold up on comment 16 and the associated logs.  This is a different failure and may be related to stale .kube. Apologies.   Re-running the failure case again with a  completely clean system.

Comment 19 Mike Fiedler 2018-04-18 19:34:27 UTC
Comment 15 was incorrect.  With the "bad inventory" (inventory.2) from that comment, I am also able to install successfully.   I am removing TestBlocker and lowering the severity.   The next puddle will have several fixes for the incorrect control-plane and ose-node images.  If we cannot reproduce on that puddle, we should close this one out.

Comment 20 Mike Fiedler 2018-04-20 17:56:41 UTC
Assigning to myself to attempt re-create on the next puddle.   We hit this yesterday on the latest openshift-ansible master but my hunch is it is a side effect of often having to reboot the master halfway through the install because of node NotReady issues such as bug 1568583

Comment 21 Scott Dodson 2018-05-02 14:17:53 UTC
I'm relatively certain that root cause is 1571992

*** This bug has been marked as a duplicate of bug 1571992 ***


Note You need to log in before you can comment on or make changes to this bug.