Bug 1487959
Summary: | Service Catalog fails to install with ovs-multitenant SDN driver enabled. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Davi Garcia <dvercill> | ||||||||
Component: | Installer | Assignee: | ewolinet | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.6.1 | CC: | aos-bugs, jokerman, mmccomas, pasik, xiuwang | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 3.7.0 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Cause: When enabling api aggregation with the ovs-multitenant SDN driver we didn't wait for the project to be ready as a netnamespace.
Consequence: When we tried to make the project global it would fail.
Fix: We now wait after creating the project to make sure it is available as a netnamespace too.
Result: The play is able to correctly make it a global project and install completely.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-11-28 22:09:17 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Davi Garcia
2017-09-03 20:15:28 UTC
Created attachment 1321594 [details]
openshift-ansible log with ovs-multitenant
Created attachment 1321595 [details]
openshift-ansible log with ovs-subnet
Created attachment 1321596 [details]
openshift-ansible inventory
[root@master-1 ~]# rpm -q openshift-ansible openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch [root@master-1 ~]# rpm -q ansible ansible-2.3.1.0-3.el7.noarch [root@master-1 ~]# ansible --version ansible 2.3.1.0 config file = /etc/ansible/ansible.cfg configured module search path = Default w/o overrides python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] QE can not reproduce this bug. From the failure, it is saying "kube-service-catalog" project is not existing, while in the project should be created in the prior task - "Set Service Catalog namespace", but in the installation log, it succeeded. Could you run `oc get project | grep 'kube-service-catalog'` to check it? I was able to reproduce this behavior in another install: TASK [openshift_service_catalog : Make kube-service-catalog project network global] *** fatal: [xpaas-master-1]: FAILED! => { "changed": true, "cmd": [ "oc", "adm", "pod-network", "make-projects-global", "kube-service-catalog" ], "delta": "0:00:00.368130", "end": "2017-09-18 00:08:38.309265", "failed": true, "rc": 1, "start": "2017-09-18 00:08:37.941135" } STDERR: error: Removing network isolation for project "kube-service-catalog" failed, error: netnamespaces.network.openshift.io "kube-service-catalog" not found to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry PLAY RECAP ********************************************************************* localhost : ok=12 changed=0 unreachable=0 failed=0 xpaas-infra-1 : ok=248 changed=46 unreachable=0 failed=0 xpaas-master-1 : ok=1065 changed=267 unreachable=0 failed=1 xpaas-master-2 : ok=522 changed=106 unreachable=0 failed=0 xpaas-master-3 : ok=522 changed=106 unreachable=0 failed=0 xpaas-node-1 : ok=248 changed=46 unreachable=0 failed=0 xpaas-node-2 : ok=248 changed=46 unreachable=0 failed=0 xpaas-node-3 : ok=248 changed=46 unreachable=0 failed=0 Failure summary: 1. Host: xpaas-master-1 Play: Service Catalog Task: openshift_service_catalog : Make kube-service-catalog project network global Message: ??? Looks like Ansible is not letting the OpenShift finish the project/namespace creating finish. Running the command you asked just after the error, I can see the project: [root@xpaas-master-1 cloud-user]# oc get project NAME DISPLAY NAME STATUS default Active kube-public Active kube-service-catalog Active kube-system Active logging Active management-infra Active openshift Active openshift-infra Active As an additional comment, if you try to retry rerun the ansible-playbook again after that error you will get an certificate error, forcing you to start from scratch again (snapshot/new env). This behavior is described at: https://docs.openshift.com/container-platform/3.6/install_config/install/advanced_install.html#installer-known-issues Today QE was running 3.7 testing, encountered the same issue (still have no chance to reproduce it in 3.6), after the failure happened, log into the master, run the same command - "oc adm pod-network make-projects-global kube-service-catalog", it succeeded. So seem like this issue is caused by performance latency, after "kube-service-catalog" namespace is created, its mapping "kube-service-catalog" netnamespace is not active yet, still in creating progress, at this moment, installer is trying to run "oc adm pod-network make-projects-global", which is trying to access an unavailable "kube-service-catalog" netnamespace, so it failed. So the recommended fix should add one more task to check "kube-service-catalog" netnamespace is active before running "oc adm pod-network make-projects-global" command. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |