Issue with openshift-ovs-multitenant network plugin while installing service catalog. It fails to create kube-service-catalog netnamespace. Playbook : playbooks/openshift-service-catalog/config.yml If fails on below task. TASK [openshift_service_catalog : Waiting for netnamespace kube-service-catalog to be ready] ********************************************************** FAILED - RETRYING: Waiting for netnamespace kube-service-catalog to be ready (30 retries left).Result was: { "attempts": 1, "changed": false, "invocation": { "module_args": { "all_namespaces": null, "content": null, "debug": false, "delete_after": false, "field_selector": null, "files": null, "force": false, "kind": "netnamespace", "kubeconfig": "/etc/origin/master/admin.kubeconfig", "name": "kube-service-catalog", "namespace": "default", "selector": null, "state": "list" } }, "results": { "cmd": "/bin/oc get netnamespace kube-service-catalog -o json -n default", "results": [ {} ], "returncode": 0, "stderr": "Error from server (NotFound): netnamespaces.network.openshift.io \"kube-service-catalog\" not found\n", "stdout": "" }, "retries": 31, "state": "list" }
Openshift Version : 3.11.51
Here's the code that creates the project/namespace and then, if multitenant is in use waits for the netnamespace. https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_service_catalog/tasks/install.yml#L8-L29
This is being run after initial cluster installation so the cluster may be under load and it may take longer than 30 seconds to complete creation. Please update the delay value to 10 and try again. If that doesn't resolve it we need to gather controller logs for debugging why the netnamespace isn't being created.
re comment #15 install failing at [template_service_broker : Register TSB with broker]. Known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1661569 A simple work around: In this maintenance release we added liveness & readiness probes on the Service Catalog pods - you could try removing them (perhaps the pods are being frequently restarted - - I'm guessing, there is no diagnostics here to suggest that). ie in openshift-ansible/roles/openshift_service_catalog/templates/api_server.j2, locate readinessProbe and livenessProbe and delete the sections and then run the playbook again. There are similar probes in the controller_manager but if that pod was being restarted it wouldn't cause this issue.
perhaps too brief on the explanation in comment #19 - the real fix adds a wait for the Service Catalog API Server and Controller Manager rollouts to finish before proceeding to install any brokers. That fix is in the maintenance release due out any moment. The work around removes the health check probes which were recently added. This issue you are hitting in Comment #15 is caused by timing. Additionally, in the code you are running the Service Catalog health and readiness checks are combined into a single probe. This may cause unnecessary pod restarts. I advise removing the health checks for now. This issue has been addressed and corrected in the next maintenance release (https://bugzilla.redhat.com/show_bug.cgi?id=1648458). I don't have any insight into the issue with failing to create the netnamespace - updating the wait delay as advised in comment #6 makes sense.
(In reply to Jay Boyd from comment #20) > perhaps too brief on the explanation in comment #19 - the real fix adds a > wait for the Service Catalog API Server and Controller Manager rollouts to > finish before proceeding to install any brokers. That fix is in the > maintenance release due out any moment. The work around removes the health > check probes which were recently added. This issue you are hitting in > Comment #15 is caused by timing. > > Additionally, in the code you are running the Service Catalog health and > readiness checks are combined into a single probe. This may cause > unnecessary pod restarts. I advise removing the health checks for now. > This issue has been addressed and corrected in the next maintenance release > (https://bugzilla.redhat.com/show_bug.cgi?id=1648458). > > I don't have any insight into the issue with failing to create the > netnamespace - updating the wait delay as advised in comment #6 makes sense. We still faced the same issue after removing the liveness and readiness probes. The issue was with the network. controller_manager.go:232] error running controllers: failed to get api versions from server: failed to get supported resources from server: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request, servicecatalog.k8s.io/v1beta1: an error on the server ("unable to set dialer for kube-service-catalog/apiserver as rest transport is of type *transport.debuggingRoundTripper") has prevented the request from succeeding We checked the metrics pod and we see "no route to host" errors while accessing kubernetes service' . Then we tried to access the service from node where pod was running and it works fine. We came to know earlier there were some network issues between master and compute nodes due to firewall ports not been setup properly. Later they were corrected. It may be the network changes were not reflected to metric pod and it was keep failing while accessing kubernetes service. So we restarted the metrics pod and this time it came up fine without errors but while accessing the api service we were still getting below error. ("unable to set dialer for kube-service-catalog/apiserver as rest transport is of type *transport.debuggingRoundTripper") Checked and found this may be related to increased loglevel so we reduced the loglevel to 2 and restarted the master api and controller services and then we were able to access the apiservice. After that we run the service catalog install playbook and it completed fine this time. Also all the projects which were stuck in terminating state were cleared on its own without taking any action. I think we can close this bug as NOTABUG.
Service Catalog is not dependent on metrics. However, it appears to be an aggregated API failure. SC controller manager is attempting to get the list of apis from the cluster api server to ensure that Service Catalog APIs were properly registered from the SC api server. Master API server fails saying it (again, master api) does not have connectivity to metrics. This causes SC Controller Manager to error out saying it is unable to get the list of apis. Correcting the metrics networking issue enables aggregated apis to function properly thus enabling Service Catalog controller manager to verify service catalog apis were registered.
Wait wait, how is this NOTABUG? Basically, if the DEBUG_LOGLEVEL in master.env is 9, this is broken...
Just to be clear, we had the same problem with our 3.10 to 3.11 upgrade and we're using `ovs-subnet`. It seems related more to the LOGLEVEL than any other issues found.
In this instance the cluster was found to have broken DNS configuration which triggered the behavior. The topic was more thoroughly discussed in private comments which cannot be made public. If you're encountering something similar please raise a new case and bug with support.