> Description of problem: The advanced installer of Red Hat OpenShift Container Platform 3.6 fails if the Service Catalog (Tech Preview) and SDN ovs-multitenant driver are enabled together. > Version-Release number of selected component (if applicable): RHEL 7.4 + OCP 3.6.1 > How reproducible: Easily > Steps to Reproduce: 1. Add the following options: os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant' openshift_enable_service_catalog=true 2. Run the advanced installer. > Actual results: TASK [openshift_service_catalog : Make kube-service-catalog project network global] *** fatal: [master-1.rhocp.acme.io]: FAILED! => { "changed": true, "cmd": [ "oc", "adm", "pod-network", "make-projects-global", "kube-service-catalog" ], "delta": "0:00:00.408251", "end": "2017-09-01 21:29:00.065358", "failed": true, "rc": 1, "start": "2017-09-01 21:28:59.657107" } STDERR: error: Removing network isolation for project "kube-service-catalog" failed, error: netnamespaces.network.openshift.io "kube-service-catalog" not found to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry PLAY RECAP ********************************************************************* infra-1.rhocp.acme.io : ok=233 changed=62 unreachable=0 failed=0 localhost : ok=13 changed=0 unreachable=0 failed=0 master-1.rhocp.acme.io : ok=1066 changed=307 unreachable=0 failed=1 master-2.rhocp.acme.io : ok=495 changed=138 unreachable=0 failed=0 master-3.rhocp.acme.io : ok=495 changed=137 unreachable=0 failed=0 nfs.rhocp.acme.io : ok=97 changed=18 unreachable=0 failed=0 node-1.rhocp.acme.io : ok=233 changed=62 unreachable=0 failed=0 node-2.rhocp.acme.io : ok=233 changed=62 unreachable=0 failed=0 Failure summary: 1. Host: master-1.rhocp.acme.io Play: Service Catalog Task: openshift_service_catalog : Make kube-service-catalog project network global Message: ??? > Expected results: PLAY RECAP ********************************************************************* infra-1.rhocp.acme.io : ok=243 changed=64 unreachable=0 failed=0 localhost : ok=13 changed=0 unreachable=0 failed=0 master-1.rhocp.acme.io : ok=1128 changed=347 unreachable=0 failed=0 master-2.rhocp.acme.io : ok=505 changed=140 unreachable=0 failed=0 master-3.rhocp.acme.io : ok=505 changed=139 unreachable=0 failed=0 nfs.rhocp.acme.io : ok=97 changed=18 unreachable=0 failed=0 node-1.rhocp.acme.io : ok=243 changed=64 unreachable=0 failed=0 node-2.rhocp.acme.io : ok=243 changed=64 unreachable=0 failed=0 > Additional info: Using the same inventory but with ovs-subnet as SDN driver, the problem doesn't happen.
Created attachment 1321594 [details] openshift-ansible log with ovs-multitenant
Created attachment 1321595 [details] openshift-ansible log with ovs-subnet
Created attachment 1321596 [details] openshift-ansible inventory
[root@master-1 ~]# rpm -q openshift-ansible openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch [root@master-1 ~]# rpm -q ansible ansible-2.3.1.0-3.el7.noarch [root@master-1 ~]# ansible --version ansible 2.3.1.0 config file = /etc/ansible/ansible.cfg configured module search path = Default w/o overrides python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]
QE can not reproduce this bug. From the failure, it is saying "kube-service-catalog" project is not existing, while in the project should be created in the prior task - "Set Service Catalog namespace", but in the installation log, it succeeded. Could you run `oc get project | grep 'kube-service-catalog'` to check it?
I was able to reproduce this behavior in another install: TASK [openshift_service_catalog : Make kube-service-catalog project network global] *** fatal: [xpaas-master-1]: FAILED! => { "changed": true, "cmd": [ "oc", "adm", "pod-network", "make-projects-global", "kube-service-catalog" ], "delta": "0:00:00.368130", "end": "2017-09-18 00:08:38.309265", "failed": true, "rc": 1, "start": "2017-09-18 00:08:37.941135" } STDERR: error: Removing network isolation for project "kube-service-catalog" failed, error: netnamespaces.network.openshift.io "kube-service-catalog" not found to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry PLAY RECAP ********************************************************************* localhost : ok=12 changed=0 unreachable=0 failed=0 xpaas-infra-1 : ok=248 changed=46 unreachable=0 failed=0 xpaas-master-1 : ok=1065 changed=267 unreachable=0 failed=1 xpaas-master-2 : ok=522 changed=106 unreachable=0 failed=0 xpaas-master-3 : ok=522 changed=106 unreachable=0 failed=0 xpaas-node-1 : ok=248 changed=46 unreachable=0 failed=0 xpaas-node-2 : ok=248 changed=46 unreachable=0 failed=0 xpaas-node-3 : ok=248 changed=46 unreachable=0 failed=0 Failure summary: 1. Host: xpaas-master-1 Play: Service Catalog Task: openshift_service_catalog : Make kube-service-catalog project network global Message: ??? Looks like Ansible is not letting the OpenShift finish the project/namespace creating finish. Running the command you asked just after the error, I can see the project: [root@xpaas-master-1 cloud-user]# oc get project NAME DISPLAY NAME STATUS default Active kube-public Active kube-service-catalog Active kube-system Active logging Active management-infra Active openshift Active openshift-infra Active
As an additional comment, if you try to retry rerun the ansible-playbook again after that error you will get an certificate error, forcing you to start from scratch again (snapshot/new env). This behavior is described at: https://docs.openshift.com/container-platform/3.6/install_config/install/advanced_install.html#installer-known-issues
Today QE was running 3.7 testing, encountered the same issue (still have no chance to reproduce it in 3.6), after the failure happened, log into the master, run the same command - "oc adm pod-network make-projects-global kube-service-catalog", it succeeded. So seem like this issue is caused by performance latency, after "kube-service-catalog" namespace is created, its mapping "kube-service-catalog" netnamespace is not active yet, still in creating progress, at this moment, installer is trying to run "oc adm pod-network make-projects-global", which is trying to access an unavailable "kube-service-catalog" netnamespace, so it failed. So the recommended fix should add one more task to check "kube-service-catalog" netnamespace is active before running "oc adm pod-network make-projects-global" command.
https://github.com/openshift/openshift-ansible/pull/5530
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188