Description of problem: When batching delete projects, there is a chance that the project gets stuck in Terminating status and cannot be deleted anymore. This condition will also be applied to the new created projects. Version-Release number of selected component (if applicable): v3.11.0-0.25.0 How reproducible: not sure Steps to Reproduce: 1. Setup multi node OCP cluster with metrics and service catalog enabled 2. Try to delete the projects after the cluster running # oc delete project kube-service-catalog openshift-infra openshift-monitoring openshift-metrics-server project.project.openshift.io "kube-service-catalog" deleted project.project.openshift.io "openshift-monitoring" deleted project.project.openshift.io "openshift-metrics-server" deleted Error from server (Forbidden): namespaces "openshift-infra" is forbidden: this namespace may not be deleted 3. Create a new project with pod in it 4. Delete the new created project 5. Check the project list on the cluster Actual results: The projects get stuck in Terminating status and cannot be deleted anymore # oc get project NAME DISPLAY NAME STATUS b6lcr Terminating default Active gug4h Terminating kube-public Active kube-service-catalog Terminating kube-system Active management-infra Active openshift Active openshift-console Active openshift-infra Active openshift-logging Active openshift-metrics-server Terminating openshift-monitoring Terminating openshift-node Active openshift-sdn Active openshift-template-service-broker Active openshift-web-console Active operator-lifecycle-manager Active # oc delete project gug4h --grace-period=0 --force warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. Error from server (Conflict): Operation cannot be fulfilled on namespaces "gug4h": The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system. The following errors found in the master log: E0904 09:58:51.428331 1 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable E0904 09:58:52.428346 1 controller.go:111] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable E0904 09:58:59.518485 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:58:59.519317 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:09.565352 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:09.566307 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:19.608121 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:19.609055 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:29.650503 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:29.651441 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:39.694453 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:39.695271 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:49.735241 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:49.736329 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:59.778107 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 09:59:59.779072 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:09.820465 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:09.821296 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:19.863278 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:19.864219 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:29.906536 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:29.907540 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:39.949135 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:39.950163 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:49.994807 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0904 10:00:49.996372 1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request Expected results: Should be able to delete the projects. Additional info: After doing the following steps, the cluster come back to normal [root@qe-bmeng-311-master-etcd-nfs-001 ~]# oc get apiservice v1beta1.metrics.k8s.io -o yaml > v1beta1.metrics.k8s.io.apiservice [root@qe-bmeng-311-master-etcd-nfs-001 ~]# oc get apiservice v1beta1.servicecatalog.k8s.io -o yaml > v1beta1.servicecatalog.k8s.io.apiservice [root@qe-bmeng-311-master-etcd-nfs-001 ~]# oc delete apiservice v1beta1.metrics.k8s.io v1beta1.servicecatalog.k8s.io apiservice.apiregistration.k8s.io "v1beta1.metrics.k8s.io" deleted apiservice.apiregistration.k8s.io "v1beta1.servicecatalog.k8s.io" deleted
Created attachment 1480742 [details] full master log
Created attachment 1480743 [details] v1beta1.metrics.k8s.io.apiservice
Created attachment 1480744 [details] v1beta1.servicecatalog.k8s.io.apiservice
> Target Release: --- → 4.0.0 Customers may run to situation in which other apiservers are not available (like https://bugzilla.redhat.com/show_bug.cgi?id=1623108#c0). But they may not know to delete apiservices. Thus this bug could be used for tracking 3.11 OCP and IMO it needs be addressed in 3.11.0 to tolerate other apiservers' failure.
(In reply to Xingxing Xia from comment #4) > > Target Release: --- → 4.0.0 > Customers may run to situation in which other apiservers are not available > (like https://bugzilla.redhat.com/show_bug.cgi?id=1623108#c0). But they may > not know to delete apiservices. > Thus this bug could be used for tracking 3.11 OCP and IMO it needs be > addressed in 3.11.0 to tolerate other apiservers' failure. Customers can setup monitoring of apiservices and alert/fix the failing api server if possible. There is not much we can do when this happen. We can't tolerate unreachable api server for many reasons (like GC won't work properly without being able to delete all orphan resources/etc.). The fix for this BZ should be updating the documentation with what commands to run to check that all API servers are up and running (oc wait) and additionally plumb all places in installer that depend on fully working API server to wait. This is not a 3.11 blocker (I believe the behavior was the same in 3.10, we just moved to static pods).
Why would you manually delete a thing that the ansible installer put in place? That doesn't seem like a good idea.
(In reply to David Eads from comment #6) > Why would you manually delete a thing that the ansible installer put in > place? That doesn't seem like a good idea. Even without manual deletion, when some apiservice becomes problematic due to whatever reason, the problem will occur. Today this bug description is found on an nextgen 4.0 env with below situation: $ oc get apiservices -o=custom-columns="name:.metadata.name,status:.status.conditions[0].status" name status ... v1beta1.metrics.k8s.io False ... v2beta1.autoscaling True Updating bug fields to highlight.
This is expected behaviour by the control plane: if aggregated apiservers are down, we cannot safely delete namespaces as only the aggregated apiserver knows how to clean objects up in etcd. Imagine we ignored that: namespace "foo" is deleted successfully. Then a use recreate "foo". Meanwhile the apiserver came back. Then the user would see old objects in "foo", maybe objects he has not even creating himself. That would be a security issue.
*** Bug 1670994 has been marked as a duplicate of this bug. ***
The root cause of bug#1670994 and this one seem to be different. 1670994 was caused by a prometheus adapter TLS rotation issue. It should actually be marked as a duplicate of a different bz. I will try to get bz1670994 corrected.
Same issue under multitenant env $ oc get clusternetwork NAME CLUSTER NETWORK SERVICE NETWORK PLUGIN NAME default 10.128.0.0/14 172.30.0.0/16 redhat/openshift-ovs-multitenant $ oc get ns | grep Terminating openshift-monitoring Terminating 4h55m test34 Terminating 3h49m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-25-234632 True False 4h43m Cluster version is 4.0.0-0.nightly-2019-02-25-234632 actually, nothing in the Terminating namespaces $ oc get all -n test34 No resources found. $ oc get all -n openshift-monitoring No resources found.
*** Bug 1671600 has been marked as a duplicate of this bug. ***
Per comment 16 and comment 19, this bug is expected and there is no fix. When this bug happens again, bugs against other apiservices (components) which are the root cause should be filed, e.g. bug 1668632 and bug 1679511. This bug isn't seen in recent builds, so moving to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758