Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1487959 - Service Catalog fails to install with ovs-multitenant SDN driver enabled.
Service Catalog fails to install with ovs-multitenant SDN driver enabled.
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer (Show other bugs)
3.6.1
x86_64 Linux
unspecified Severity high
: ---
: 3.7.0
Assigned To: ewolinet
Johnny Liu
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-09-03 16:15 EDT by Davi Garcia
Modified: 2017-11-28 17:09 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When enabling api aggregation with the ovs-multitenant SDN driver we didn't wait for the project to be ready as a netnamespace. Consequence: When we tried to make the project global it would fail. Fix: We now wait after creating the project to make sure it is available as a netnamespace too. Result: The play is able to correctly make it a global project and install completely.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-11-28 17:09:17 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
openshift-ansible log with ovs-multitenant (910.91 KB, text/plain)
2017-09-03 16:18 EDT, Davi Garcia
no flags Details
openshift-ansible log with ovs-subnet (924.99 KB, text/plain)
2017-09-03 16:18 EDT, Davi Garcia
no flags Details
openshift-ansible inventory (2.25 KB, text/plain)
2017-09-03 16:19 EDT, Davi Garcia
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-28 21:34:54 EST

  None (edit)
Description Davi Garcia 2017-09-03 16:15:28 EDT
> Description of problem:

The advanced installer of Red Hat OpenShift Container Platform 3.6 fails if the Service Catalog (Tech Preview) and SDN ovs-multitenant driver are enabled together.

> Version-Release number of selected component (if applicable):

RHEL 7.4 + OCP 3.6.1

> How reproducible:

Easily

> Steps to Reproduce:

1. Add the following options:
   os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
   openshift_enable_service_catalog=true
2. Run the advanced installer.

> Actual results:

TASK [openshift_service_catalog : Make kube-service-catalog project network global] ***
fatal: [master-1.rhocp.acme.io]: FAILED! => {
    "changed": true, 
    "cmd": [
        "oc", 
        "adm", 
        "pod-network", 
        "make-projects-global", 
        "kube-service-catalog"
    ], 
    "delta": "0:00:00.408251", 
    "end": "2017-09-01 21:29:00.065358", 
    "failed": true, 
    "rc": 1, 
    "start": "2017-09-01 21:28:59.657107"
}

STDERR:

error: Removing network isolation for project "kube-service-catalog" failed, error: netnamespaces.network.openshift.io "kube-service-catalog" not found
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************
infra-1.rhocp.acme.io      : ok=233  changed=62   unreachable=0    failed=0   
localhost                  : ok=13   changed=0    unreachable=0    failed=0   
master-1.rhocp.acme.io     : ok=1066 changed=307  unreachable=0    failed=1   
master-2.rhocp.acme.io     : ok=495  changed=138  unreachable=0    failed=0   
master-3.rhocp.acme.io     : ok=495  changed=137  unreachable=0    failed=0   
nfs.rhocp.acme.io          : ok=97   changed=18   unreachable=0    failed=0   
node-1.rhocp.acme.io       : ok=233  changed=62   unreachable=0    failed=0   
node-2.rhocp.acme.io       : ok=233  changed=62   unreachable=0    failed=0   


Failure summary:

  1. Host:     master-1.rhocp.acme.io
     Play:     Service Catalog
     Task:     openshift_service_catalog : Make kube-service-catalog project network global
     Message:  ???


> Expected results:

PLAY RECAP *********************************************************************
infra-1.rhocp.acme.io      : ok=243  changed=64   unreachable=0    failed=0   
localhost                  : ok=13   changed=0    unreachable=0    failed=0   
master-1.rhocp.acme.io     : ok=1128 changed=347  unreachable=0    failed=0   
master-2.rhocp.acme.io     : ok=505  changed=140  unreachable=0    failed=0   
master-3.rhocp.acme.io     : ok=505  changed=139  unreachable=0    failed=0   
nfs.rhocp.acme.io          : ok=97   changed=18   unreachable=0    failed=0   
node-1.rhocp.acme.io       : ok=243  changed=64   unreachable=0    failed=0   
node-2.rhocp.acme.io       : ok=243  changed=64   unreachable=0    failed=0   

> Additional info:

Using the same inventory but with ovs-subnet as SDN driver, the problem doesn't happen.
Comment 1 Davi Garcia 2017-09-03 16:18 EDT
Created attachment 1321594 [details]
openshift-ansible log with ovs-multitenant
Comment 2 Davi Garcia 2017-09-03 16:18 EDT
Created attachment 1321595 [details]
openshift-ansible log with ovs-subnet
Comment 3 Davi Garcia 2017-09-03 16:19 EDT
Created attachment 1321596 [details]
openshift-ansible inventory
Comment 4 Davi Garcia 2017-09-03 16:21:37 EDT
[root@master-1 ~]# rpm -q openshift-ansible
openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch

[root@master-1 ~]# rpm -q ansible
ansible-2.3.1.0-3.el7.noarch

[root@master-1 ~]# ansible --version
ansible 2.3.1.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]
Comment 5 Johnny Liu 2017-09-05 01:57:03 EDT
QE can not reproduce this bug. From the failure, it is saying "kube-service-catalog" project is not existing, while in the project should be created in the prior task - "Set Service Catalog namespace", but in the installation log, it succeeded.

Could you run `oc get project | grep 'kube-service-catalog'` to check it?
Comment 6 Davi Garcia 2017-09-18 00:15:01 EDT
I was able to reproduce this behavior in another install:

TASK [openshift_service_catalog : Make kube-service-catalog project network global] ***
fatal: [xpaas-master-1]: FAILED! => {
    "changed": true, 
    "cmd": [
        "oc", 
        "adm", 
        "pod-network", 
        "make-projects-global", 
        "kube-service-catalog"
    ], 
    "delta": "0:00:00.368130", 
    "end": "2017-09-18 00:08:38.309265", 
    "failed": true, 
    "rc": 1, 
    "start": "2017-09-18 00:08:37.941135"
}

STDERR:

error: Removing network isolation for project "kube-service-catalog" failed, error: netnamespaces.network.openshift.io "kube-service-catalog" not found
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
xpaas-infra-1              : ok=248  changed=46   unreachable=0    failed=0   
xpaas-master-1             : ok=1065 changed=267  unreachable=0    failed=1   
xpaas-master-2             : ok=522  changed=106  unreachable=0    failed=0   
xpaas-master-3             : ok=522  changed=106  unreachable=0    failed=0   
xpaas-node-1               : ok=248  changed=46   unreachable=0    failed=0   
xpaas-node-2               : ok=248  changed=46   unreachable=0    failed=0   
xpaas-node-3               : ok=248  changed=46   unreachable=0    failed=0   


Failure summary:

  1. Host:     xpaas-master-1
     Play:     Service Catalog
     Task:     openshift_service_catalog : Make kube-service-catalog project network global
     Message:  ???

Looks like Ansible is not letting the OpenShift finish the project/namespace creating finish. Running the command you asked just after the error, I can see the project:

[root@xpaas-master-1 cloud-user]# oc get project 
NAME                   DISPLAY NAME   STATUS
default                               Active
kube-public                           Active
kube-service-catalog                  Active
kube-system                           Active
logging                               Active
management-infra                      Active
openshift                             Active
openshift-infra                       Active
Comment 7 Davi Garcia 2017-09-18 01:11:56 EDT
As an additional comment,  if you try to retry rerun the ansible-playbook again after that error you will get an certificate error, forcing you to start from scratch again (snapshot/new env). This behavior is described at:
https://docs.openshift.com/container-platform/3.6/install_config/install/advanced_install.html#installer-known-issues
Comment 8 Johnny Liu 2017-09-25 05:46:28 EDT
Today QE was running 3.7 testing, encountered the same issue (still have no chance to reproduce it in 3.6), after the failure happened, log into the master, run the same command - "oc adm pod-network make-projects-global kube-service-catalog", it succeeded. 

So seem like this issue is caused by performance latency, after "kube-service-catalog" namespace is created, its mapping "kube-service-catalog" netnamespace is not active yet, still in creating progress, at this moment, installer is trying to run "oc adm pod-network make-projects-global", which is trying to access an unavailable "kube-service-catalog" netnamespace, so it failed. 

So the recommended fix should add one more task to check "kube-service-catalog" netnamespace is active before running "oc adm pod-network make-projects-global" command.
Comment 14 errata-xmlrpc 2017-11-28 17:09:17 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.