Bug 1487959 - Service Catalog fails to install with ovs-multitenant SDN driver enabled.
Summary: Service Catalog fails to install with ovs-multitenant SDN driver enabled.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.7.0
Assignee: ewolinet
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-03 20:15 UTC by Davi Garcia
Modified: 2017-11-28 22:09 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When enabling api aggregation with the ovs-multitenant SDN driver we didn't wait for the project to be ready as a netnamespace. Consequence: When we tried to make the project global it would fail. Fix: We now wait after creating the project to make sure it is available as a netnamespace too. Result: The play is able to correctly make it a global project and install completely.
Clone Of:
Environment:
Last Closed: 2017-11-28 22:09:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openshift-ansible log with ovs-multitenant (910.91 KB, text/plain)
2017-09-03 20:18 UTC, Davi Garcia
no flags Details
openshift-ansible log with ovs-subnet (924.99 KB, text/plain)
2017-09-03 20:18 UTC, Davi Garcia
no flags Details
openshift-ansible inventory (2.25 KB, text/plain)
2017-09-03 20:19 UTC, Davi Garcia
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Davi Garcia 2017-09-03 20:15:28 UTC
> Description of problem:

The advanced installer of Red Hat OpenShift Container Platform 3.6 fails if the Service Catalog (Tech Preview) and SDN ovs-multitenant driver are enabled together.

> Version-Release number of selected component (if applicable):

RHEL 7.4 + OCP 3.6.1

> How reproducible:

Easily

> Steps to Reproduce:

1. Add the following options:
   os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
   openshift_enable_service_catalog=true
2. Run the advanced installer.

> Actual results:

TASK [openshift_service_catalog : Make kube-service-catalog project network global] ***
fatal: [master-1.rhocp.acme.io]: FAILED! => {
    "changed": true, 
    "cmd": [
        "oc", 
        "adm", 
        "pod-network", 
        "make-projects-global", 
        "kube-service-catalog"
    ], 
    "delta": "0:00:00.408251", 
    "end": "2017-09-01 21:29:00.065358", 
    "failed": true, 
    "rc": 1, 
    "start": "2017-09-01 21:28:59.657107"
}

STDERR:

error: Removing network isolation for project "kube-service-catalog" failed, error: netnamespaces.network.openshift.io "kube-service-catalog" not found
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************
infra-1.rhocp.acme.io      : ok=233  changed=62   unreachable=0    failed=0   
localhost                  : ok=13   changed=0    unreachable=0    failed=0   
master-1.rhocp.acme.io     : ok=1066 changed=307  unreachable=0    failed=1   
master-2.rhocp.acme.io     : ok=495  changed=138  unreachable=0    failed=0   
master-3.rhocp.acme.io     : ok=495  changed=137  unreachable=0    failed=0   
nfs.rhocp.acme.io          : ok=97   changed=18   unreachable=0    failed=0   
node-1.rhocp.acme.io       : ok=233  changed=62   unreachable=0    failed=0   
node-2.rhocp.acme.io       : ok=233  changed=62   unreachable=0    failed=0   


Failure summary:

  1. Host:     master-1.rhocp.acme.io
     Play:     Service Catalog
     Task:     openshift_service_catalog : Make kube-service-catalog project network global
     Message:  ???


> Expected results:

PLAY RECAP *********************************************************************
infra-1.rhocp.acme.io      : ok=243  changed=64   unreachable=0    failed=0   
localhost                  : ok=13   changed=0    unreachable=0    failed=0   
master-1.rhocp.acme.io     : ok=1128 changed=347  unreachable=0    failed=0   
master-2.rhocp.acme.io     : ok=505  changed=140  unreachable=0    failed=0   
master-3.rhocp.acme.io     : ok=505  changed=139  unreachable=0    failed=0   
nfs.rhocp.acme.io          : ok=97   changed=18   unreachable=0    failed=0   
node-1.rhocp.acme.io       : ok=243  changed=64   unreachable=0    failed=0   
node-2.rhocp.acme.io       : ok=243  changed=64   unreachable=0    failed=0   

> Additional info:

Using the same inventory but with ovs-subnet as SDN driver, the problem doesn't happen.

Comment 1 Davi Garcia 2017-09-03 20:18:07 UTC
Created attachment 1321594 [details]
openshift-ansible log with ovs-multitenant

Comment 2 Davi Garcia 2017-09-03 20:18:53 UTC
Created attachment 1321595 [details]
openshift-ansible log with ovs-subnet

Comment 3 Davi Garcia 2017-09-03 20:19:36 UTC
Created attachment 1321596 [details]
openshift-ansible inventory

Comment 4 Davi Garcia 2017-09-03 20:21:37 UTC
[root@master-1 ~]# rpm -q openshift-ansible
openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch

[root@master-1 ~]# rpm -q ansible
ansible-2.3.1.0-3.el7.noarch

[root@master-1 ~]# ansible --version
ansible 2.3.1.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

Comment 5 Johnny Liu 2017-09-05 05:57:03 UTC
QE can not reproduce this bug. From the failure, it is saying "kube-service-catalog" project is not existing, while in the project should be created in the prior task - "Set Service Catalog namespace", but in the installation log, it succeeded.

Could you run `oc get project | grep 'kube-service-catalog'` to check it?

Comment 6 Davi Garcia 2017-09-18 04:15:01 UTC
I was able to reproduce this behavior in another install:

TASK [openshift_service_catalog : Make kube-service-catalog project network global] ***
fatal: [xpaas-master-1]: FAILED! => {
    "changed": true, 
    "cmd": [
        "oc", 
        "adm", 
        "pod-network", 
        "make-projects-global", 
        "kube-service-catalog"
    ], 
    "delta": "0:00:00.368130", 
    "end": "2017-09-18 00:08:38.309265", 
    "failed": true, 
    "rc": 1, 
    "start": "2017-09-18 00:08:37.941135"
}

STDERR:

error: Removing network isolation for project "kube-service-catalog" failed, error: netnamespaces.network.openshift.io "kube-service-catalog" not found
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
xpaas-infra-1              : ok=248  changed=46   unreachable=0    failed=0   
xpaas-master-1             : ok=1065 changed=267  unreachable=0    failed=1   
xpaas-master-2             : ok=522  changed=106  unreachable=0    failed=0   
xpaas-master-3             : ok=522  changed=106  unreachable=0    failed=0   
xpaas-node-1               : ok=248  changed=46   unreachable=0    failed=0   
xpaas-node-2               : ok=248  changed=46   unreachable=0    failed=0   
xpaas-node-3               : ok=248  changed=46   unreachable=0    failed=0   


Failure summary:

  1. Host:     xpaas-master-1
     Play:     Service Catalog
     Task:     openshift_service_catalog : Make kube-service-catalog project network global
     Message:  ???

Looks like Ansible is not letting the OpenShift finish the project/namespace creating finish. Running the command you asked just after the error, I can see the project:

[root@xpaas-master-1 cloud-user]# oc get project 
NAME                   DISPLAY NAME   STATUS
default                               Active
kube-public                           Active
kube-service-catalog                  Active
kube-system                           Active
logging                               Active
management-infra                      Active
openshift                             Active
openshift-infra                       Active

Comment 7 Davi Garcia 2017-09-18 05:11:56 UTC
As an additional comment,  if you try to retry rerun the ansible-playbook again after that error you will get an certificate error, forcing you to start from scratch again (snapshot/new env). This behavior is described at:
https://docs.openshift.com/container-platform/3.6/install_config/install/advanced_install.html#installer-known-issues

Comment 8 Johnny Liu 2017-09-25 09:46:28 UTC
Today QE was running 3.7 testing, encountered the same issue (still have no chance to reproduce it in 3.6), after the failure happened, log into the master, run the same command - "oc adm pod-network make-projects-global kube-service-catalog", it succeeded. 

So seem like this issue is caused by performance latency, after "kube-service-catalog" namespace is created, its mapping "kube-service-catalog" netnamespace is not active yet, still in creating progress, at this moment, installer is trying to run "oc adm pod-network make-projects-global", which is trying to access an unavailable "kube-service-catalog" netnamespace, so it failed. 

So the recommended fix should add one more task to check "kube-service-catalog" netnamespace is active before running "oc adm pod-network make-projects-global" command.

Comment 14 errata-xmlrpc 2017-11-28 22:09:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.