Bug 1566629
Summary: | Installer failing - x509: certificate is not valid | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> | ||||||||
Component: | Installer | Assignee: | Mike Fiedler <mifiedle> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Johnny Liu <jialiu> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.10.0 | CC: | aos-bugs, jeder, jokerman, mifiedle, mmccomas, vlaad, wmeng | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 3.10.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | aos-scalability-310 | ||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-05-02 14:17:53 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Vikas Laad
2018-04-12 16:09:55 UTC
Created attachment 1420921 [details]
ansible log
This is happening because kube_svc_ip is not being calculated correctly based on portal_net. At first I thought that we'd lost the kube_svc_ip being appended to openshift.common.all_hostnames but it must be somewhere else in the facts refactoring. I've tested our function that generates the first IP for a network, it should be working as designed. My initial feeling is that this is related to bootstrapping in some way. @Vikas The node labels and selectors are using obsolete versions. You'll need to adjust your inventory to consume the new node labeling scheme in 3.10. I'm still investigating this issue. (In reply to Michael Gugino from comment #5) > @Vikas > > The node labels and selectors are using obsolete versions. You'll need to > adjust your inventory to consume the new node labeling scheme in 3.10. OK, I will change the inventory file. I don't see any obvious reason for this to happen. Can you provide debugging output with -vvv ? Created attachment 1421781 [details]
ansible log with -vvv
172.30.0.0/16 is the default for openshift_portal_net if not otherwise set. It appears this cluster was already built (at least partially) and the portal net has changed. At the very least, the attached log shows that the certs are already present. I will try your specific portal net settings to see if those values are being disregarded. I was unable to replicate this error with the portal nets provided. Can you try deploying against an entirely new set of hosts? Hi Michael, there were few BZs filed https://bugzilla.redhat.com/show_bug.cgi?id=1568031 and https://bugzilla.redhat.com/show_bug.cgi?id=1565442 I will try again after these blocker bugs are fixed. With v3.10.0-0.21.0 the router and registry deployments still fail with this error. openshift_master_portal_net=172.24.0.0/14 openshift_portal_net=172.24.0.0/14 osm_cluster_network_cidr=172.20.0.0/14 These values have always worked in the past. Will try taking the defaults for these. Taking the defaults for network cidrs resulted in a working cluster for me. We use the modfied values in comment 12 for cluster horizontal scalability testing. I'm still looking for -vvv output from a cluster that doesn't already have certs generated. Rerunning the installer does not generate new certs if they are already present. I'll run an install with -vvv now. It will possibly be two different ansible-playbook invocations due to https://bugzilla.redhat.com/show_bug.cgi?id=1568583 forcing a reboot after control-plane startup. But, I'm on it. I think I narrowed this down some. It appears to be an issue with public vs private hostnames and nothing to do with the CIDRs ===================== Install method 1 - fails with this bz. Corresponds to inventory.1 and ansible.log.1 no openshift_master_cluster_hostname - take the default no openshift_master_cluster_public_hostname - take the default Each host in the inventory uses public AWS hostname and specifies openshift_public_hostname for each host (which might be redundant but has worked in the past). ==================== Install method 2 - succeeds for the same set of hosts. Corresponds to inventory.2 and ansible.log.2 openshift_master_cluster_hostname set to AWS private hostname openshift_master_cluster_public_hostname set to AWS public hostname Each host in the inventory uses the AWS internal hostname I'll attach a tarball with the logs and (redacted) inventories Created attachment 1423692 [details]
ansible -vvv log for failing and successful installs
Well, hold up on comment 16 and the associated logs. This is a different failure and may be related to stale .kube. Apologies. Re-running the failure case again with a completely clean system. Comment 15 was incorrect. With the "bad inventory" (inventory.2) from that comment, I am also able to install successfully. I am removing TestBlocker and lowering the severity. The next puddle will have several fixes for the incorrect control-plane and ose-node images. If we cannot reproduce on that puddle, we should close this one out. Assigning to myself to attempt re-create on the next puddle. We hit this yesterday on the latest openshift-ansible master but my hunch is it is a side effect of often having to reboot the master halfway through the install because of node NotReady issues such as bug 1568583 I'm relatively certain that root cause is 1571992 *** This bug has been marked as a duplicate of bug 1571992 *** |