Description of problem:
Installer is failing at the following step
TASK [openshift_service_catalog : Verify that the web console is running]
FAILED - RETRYING: Verify that the web console is running (60 retries left).
FAILED - RETRYING: Verify that the web console is running (59 retries left).
FAILED - RETRYING: Verify that the web console is running (58 retries left).
FAILED - RETRYING: Verify that the web console is running (57 retries left).
api-server pod is in CrashLoopBackOff
kube-service-catalog apiserver-tqb6n 0/1 CrashLoopBackOff 6 8m
kube-service-catalog controller-manager-78rx5 1/1 Running 0 42m
Logs from apiserver pod
root@ip-172-31-16-115: ~ # oc logs -n kube-service-catalog apiserver-tqb6n
I0412 16:03:51.795294 1 feature_gate.go:184] feature gates: map[OriginatingIdentity:true]
I0412 16:03:51.795465 1 hyperkube.go:188] Service Catalog version v0.0.0-master+$Format:%h$ (built 2018-04-11T19:40:47Z)
W0412 16:03:52.016238 1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.24.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: x509: certificate is valid for 172.30.0.1, 172.31.16.115, 184.108.40.206, not 172.24.0.1
Version-Release number of the following components:
openshift-ansible head is f93c5ce8ec9fef0044e9dbf9e5fc82271e26f3a6
rpm -q ansible
config file = /root/openshift-ansible/ansible.cfg
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
Steps to Reproduce:
1. run deploy_cluster playbook with attached inventory
Installer fails, please see ansible log and inventory attached.
Installer should complete.
Please attach logs from ansible-playbook with the -vvv flag
Created attachment 1420921 [details]
This is happening because kube_svc_ip is not being calculated correctly based on portal_net. At first I thought that we'd lost the kube_svc_ip being appended to openshift.common.all_hostnames but it must be somewhere else in the facts refactoring.
I've tested our function that generates the first IP for a network, it should be working as designed.
My initial feeling is that this is related to bootstrapping in some way.
The node labels and selectors are using obsolete versions. You'll need to adjust your inventory to consume the new node labeling scheme in 3.10.
I'm still investigating this issue.
(In reply to Michael Gugino from comment #5)
> The node labels and selectors are using obsolete versions. You'll need to
> adjust your inventory to consume the new node labeling scheme in 3.10.
OK, I will change the inventory file.
I don't see any obvious reason for this to happen. Can you provide debugging output with -vvv ?
Created attachment 1421781 [details]
ansible log with -vvv
172.30.0.0/16 is the default for openshift_portal_net if not otherwise set.
It appears this cluster was already built (at least partially) and the portal net has changed. At the very least, the attached log shows that the certs are already present.
I will try your specific portal net settings to see if those values are being disregarded.
I was unable to replicate this error with the portal nets provided.
Can you try deploying against an entirely new set of hosts?
there were few BZs filed https://bugzilla.redhat.com/show_bug.cgi?id=1568031 and https://bugzilla.redhat.com/show_bug.cgi?id=1565442
I will try again after these blocker bugs are fixed.
With v3.10.0-0.21.0 the router and registry deployments still fail with this error.
These values have always worked in the past. Will try taking the defaults for these.
Taking the defaults for network cidrs resulted in a working cluster for me. We use the modfied values in comment 12 for cluster horizontal scalability testing.
I'm still looking for -vvv output from a cluster that doesn't already have certs generated. Rerunning the installer does not generate new certs if they are already present.
I'll run an install with -vvv now. It will possibly be two different ansible-playbook invocations due to https://bugzilla.redhat.com/show_bug.cgi?id=1568583 forcing a reboot after control-plane startup. But, I'm on it.
I think I narrowed this down some. It appears to be an issue with public vs private hostnames and nothing to do with the CIDRs
Install method 1 - fails with this bz. Corresponds to inventory.1 and ansible.log.1
no openshift_master_cluster_hostname - take the default
no openshift_master_cluster_public_hostname - take the default
Each host in the inventory uses public AWS hostname and specifies openshift_public_hostname for each host (which might be redundant but has worked in the past).
Install method 2 - succeeds for the same set of hosts. Corresponds to inventory.2 and ansible.log.2
openshift_master_cluster_hostname set to AWS private hostname
openshift_master_cluster_public_hostname set to AWS public hostname
Each host in the inventory uses the AWS internal hostname
I'll attach a tarball with the logs and (redacted) inventories
Created attachment 1423692 [details]
ansible -vvv log for failing and successful installs
Well, hold up on comment 16 and the associated logs. This is a different failure and may be related to stale .kube. Apologies. Re-running the failure case again with a completely clean system.
Comment 15 was incorrect. With the "bad inventory" (inventory.2) from that comment, I am also able to install successfully. I am removing TestBlocker and lowering the severity. The next puddle will have several fixes for the incorrect control-plane and ose-node images. If we cannot reproduce on that puddle, we should close this one out.
Assigning to myself to attempt re-create on the next puddle. We hit this yesterday on the latest openshift-ansible master but my hunch is it is a side effect of often having to reboot the master halfway through the install because of node NotReady issues such as bug 1568583
I'm relatively certain that root cause is 1571992
*** This bug has been marked as a duplicate of bug 1571992 ***