Bug 1669131 - Issue with openshift-ovs-multitenant network plugin while installing service catalog
Summary: Issue with openshift-ovs-multitenant network plugin while installing service ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.11.z
Assignee: Scott Dodson
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-24 12:18 UTC by Mohit
Modified: 2019-02-14 18:31 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-01 15:36:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mohit 2019-01-24 12:18:05 UTC
Issue with openshift-ovs-multitenant network plugin while installing service catalog. It fails to create kube-service-catalog netnamespace.

Playbook :  playbooks/openshift-service-catalog/config.yml


If fails on below task.



TASK [openshift_service_catalog : Waiting for netnamespace kube-service-catalog to be ready] **********************************************************
FAILED - RETRYING: Waiting for netnamespace kube-service-catalog to be ready (30 retries left).Result was: {
    "attempts": 1, 
    "changed": false, 
    "invocation": {
        "module_args": {
            "all_namespaces": null, 
            "content": null, 
            "debug": false, 
            "delete_after": false, 
            "field_selector": null, 
            "files": null, 
            "force": false, 
            "kind": "netnamespace", 
            "kubeconfig": "/etc/origin/master/admin.kubeconfig", 
            "name": "kube-service-catalog", 
            "namespace": "default", 
            "selector": null, 
            "state": "list"
        }
    }, 
    "results": {
        "cmd": "/bin/oc get netnamespace kube-service-catalog -o json -n default", 
        "results": [
            {}
        ], 
        "returncode": 0, 
        "stderr": "Error from server (NotFound): netnamespaces.network.openshift.io \"kube-service-catalog\" not found\n", 
        "stdout": ""
    }, 
    "retries": 31, 
    "state": "list"
}

Comment 3 Mohit 2019-01-24 13:01:27 UTC
Openshift Version : 3.11.51

Comment 5 Scott Dodson 2019-01-24 13:52:52 UTC
Here's the code that creates the project/namespace and then, if multitenant is in use waits for the netnamespace.

https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_service_catalog/tasks/install.yml#L8-L29

Comment 6 Scott Dodson 2019-01-24 13:59:17 UTC
This is being run after initial cluster installation so the cluster may be under load and it may take longer than 30 seconds to complete creation. Please update the delay value to 10 and try again. If that doesn't resolve it we need to gather controller logs for debugging why the netnamespace isn't being created.

Comment 19 Jay Boyd 2019-01-30 17:55:27 UTC
re comment #15 install failing at [template_service_broker : Register TSB with broker].  Known issue:  https://bugzilla.redhat.com/show_bug.cgi?id=1661569

A simple work around:  In this maintenance release we added liveness & readiness probes on the Service Catalog pods - you could try removing them (perhaps the pods are being frequently restarted - - I'm guessing, there is no diagnostics here to suggest that).  ie in openshift-ansible/roles/openshift_service_catalog/templates/api_server.j2,  locate readinessProbe and livenessProbe and delete the sections and then run the playbook again.  There are similar probes in the controller_manager but if that pod was being restarted it wouldn't cause this issue.

Comment 20 Jay Boyd 2019-01-30 18:11:44 UTC
perhaps too brief on the explanation in comment #19 - the real fix adds a wait for the Service Catalog API Server and Controller Manager rollouts to finish before proceeding to install any brokers.  That fix is in the maintenance release due out any moment.  The work around removes the health check probes which were recently added.  This issue you are hitting in Comment #15 is caused by timing.

Additionally, in the code you are running the Service Catalog health and readiness checks are combined into a single probe.  This may cause unnecessary pod restarts.  I advise removing the health checks for now.  This issue has been addressed and corrected in the next maintenance release (https://bugzilla.redhat.com/show_bug.cgi?id=1648458).

I don't have any insight into the issue with failing to create the netnamespace - updating the wait delay as advised in comment #6 makes sense.

Comment 21 Suresh 2019-01-31 10:05:38 UTC
(In reply to Jay Boyd from comment #20)
> perhaps too brief on the explanation in comment #19 - the real fix adds a
> wait for the Service Catalog API Server and Controller Manager rollouts to
> finish before proceeding to install any brokers.  That fix is in the
> maintenance release due out any moment.  The work around removes the health
> check probes which were recently added.  This issue you are hitting in
> Comment #15 is caused by timing.
> 
> Additionally, in the code you are running the Service Catalog health and
> readiness checks are combined into a single probe.  This may cause
> unnecessary pod restarts.  I advise removing the health checks for now. 
> This issue has been addressed and corrected in the next maintenance release
> (https://bugzilla.redhat.com/show_bug.cgi?id=1648458).
> 
> I don't have any insight into the issue with failing to create the
> netnamespace - updating the wait delay as advised in comment #6 makes sense.

We still faced the same issue after removing the liveness and readiness probes.

The issue was with the network. 


controller_manager.go:232] error running controllers: failed to get api versions from server: failed to get supported resources from server: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request, servicecatalog.k8s.io/v1beta1: an error on the server ("unable to set dialer for kube-service-catalog/apiserver as rest transport is of type *transport.debuggingRoundTripper") has prevented the request from succeeding

We checked the metrics pod and we see "no route to host" errors while accessing kubernetes service' . Then we tried to access the service from node where pod was running and it works fine.

We came to know earlier there were some network issues between master and compute nodes due to firewall ports not been setup properly. Later they were corrected.
It may be the network changes were not reflected to metric pod and it was keep failing while accessing kubernetes service.

So we restarted the metrics pod and this time it came up fine without errors but while accessing the api service we were still getting below error.

("unable to set dialer for kube-service-catalog/apiserver as rest transport is of type *transport.debuggingRoundTripper")

Checked and found this may be related to increased loglevel so we reduced the loglevel to 2 and restarted the master api and controller services and then we were able to access the apiservice.

After that we run the service catalog install playbook and it completed fine this time.

Also all the projects which were stuck in terminating state were cleared on its own without taking any action.

I think we can close this bug as NOTABUG.

Comment 23 Jay Boyd 2019-01-31 13:00:41 UTC
Service Catalog is not dependent on metrics.

However, it appears to be an aggregated API failure.  SC controller manager is attempting to get the list of apis from the cluster api server to ensure that Service Catalog APIs were properly registered from the SC api server.  Master API server fails saying it (again, master api) does not have connectivity to metrics.  This causes SC Controller Manager to error out saying it is unable to get the list of apis.

Correcting the metrics networking issue enables aggregated apis to function properly thus enabling Service Catalog controller manager to verify service catalog apis were registered.

Comment 27 Miranda Shutt 2019-02-14 17:27:37 UTC
Wait wait, how is this NOTABUG?  Basically, if the DEBUG_LOGLEVEL in master.env is 9, this is broken...

Comment 28 Miranda Shutt 2019-02-14 17:28:19 UTC
Just to be clear, we had the same problem with our 3.10 to 3.11 upgrade and we're using `ovs-subnet`.  It seems related more to the LOGLEVEL than any other issues found.

Comment 29 Scott Dodson 2019-02-14 18:31:21 UTC
In this instance the cluster was found to have broken DNS configuration which triggered the behavior. The topic was more thoroughly discussed in private comments which cannot be made public.

If you're encountering something similar please raise a new case and bug with support.


Note You need to log in before you can comment on or make changes to this bug.