Bug 1625194

Summary: Projects get stuck in Terminating status for long time [comment 0 actually Not A Bug per comment 25]
Product: OpenShift Container Platform Reporter: Meng Bo <bmeng>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.11.0CC: akrzos, aos-bugs, deads, decarr, jiazha, jokerman, jrosenta, juzhao, mfojtik, mifiedle, mjahangi, mmccomas, sponnaga, sttts, wmeng, wsun
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-40
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:40:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1664187    
Attachments:
Description Flags
full master log
none
v1beta1.metrics.k8s.io.apiservice
none
v1beta1.servicecatalog.k8s.io.apiservice none

Description Meng Bo 2018-09-04 11:10:49 UTC
Description of problem:
When batching delete projects, there is a chance that the project gets stuck in Terminating status and cannot be deleted anymore. This condition will also be applied to the new created projects. 

Version-Release number of selected component (if applicable):
v3.11.0-0.25.0

How reproducible:
not sure

Steps to Reproduce:
1. Setup multi node OCP cluster with metrics and service catalog enabled
2. Try to delete the projects after the cluster running
# oc delete project kube-service-catalog openshift-infra openshift-monitoring openshift-metrics-server
project.project.openshift.io "kube-service-catalog" deleted
project.project.openshift.io "openshift-monitoring" deleted
project.project.openshift.io "openshift-metrics-server" deleted
Error from server (Forbidden): namespaces "openshift-infra" is forbidden: this namespace may not be deleted

3. Create a new project with pod in it
4. Delete the new created project
5. Check the project list on the cluster

Actual results:
The projects get stuck in Terminating status and cannot be deleted anymore

# oc get project
NAME                                DISPLAY NAME   STATUS
b6lcr                                              Terminating
default                                            Active
gug4h                                              Terminating
kube-public                                        Active
kube-service-catalog                               Terminating
kube-system                                        Active
management-infra                                   Active
openshift                                          Active
openshift-console                                  Active
openshift-infra                                    Active
openshift-logging                                  Active
openshift-metrics-server                           Terminating
openshift-monitoring                               Terminating
openshift-node                                     Active
openshift-sdn                                      Active
openshift-template-service-broker                  Active
openshift-web-console                              Active
operator-lifecycle-manager                         Active

# oc delete project gug4h --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (Conflict): Operation cannot be fulfilled on namespaces "gug4h": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.


The following errors found in the master log:
E0904 09:58:51.428331       1 controller.go:111] loading OpenAPI spec for "v1beta1.servicecatalog.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
E0904 09:58:52.428346       1 controller.go:111] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
E0904 09:58:59.518485       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:58:59.519317       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:09.565352       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:09.566307       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:19.608121       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:19.609055       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:29.650503       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:29.651441       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:39.694453       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:39.695271       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:49.735241       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:49.736329       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:59.778107       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 09:59:59.779072       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:09.820465       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:09.821296       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:19.863278       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:19.864219       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:29.906536       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:29.907540       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:39.949135       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:39.950163       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:49.994807       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0904 10:00:49.996372       1 memcache.go:147] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request


Expected results:
Should be able to delete the projects.



Additional info:
After doing the following steps, the cluster come back to normal
    [root@qe-bmeng-311-master-etcd-nfs-001 ~]# oc get apiservice v1beta1.metrics.k8s.io -o yaml > v1beta1.metrics.k8s.io.apiservice
    [root@qe-bmeng-311-master-etcd-nfs-001 ~]# oc get apiservice v1beta1.servicecatalog.k8s.io -o yaml > v1beta1.servicecatalog.k8s.io.apiservice
    [root@qe-bmeng-311-master-etcd-nfs-001 ~]# oc delete apiservice v1beta1.metrics.k8s.io v1beta1.servicecatalog.k8s.io
    apiservice.apiregistration.k8s.io "v1beta1.metrics.k8s.io" deleted
    apiservice.apiregistration.k8s.io "v1beta1.servicecatalog.k8s.io" deleted

Comment 1 Meng Bo 2018-09-04 11:12:05 UTC
Created attachment 1480742 [details]
full master log

Comment 2 Meng Bo 2018-09-04 11:17:36 UTC
Created attachment 1480743 [details]
v1beta1.metrics.k8s.io.apiservice

Comment 3 Meng Bo 2018-09-04 11:17:55 UTC
Created attachment 1480744 [details]
v1beta1.servicecatalog.k8s.io.apiservice

Comment 4 Xingxing Xia 2018-09-05 09:19:45 UTC
> Target Release: --- → 4.0.0
Customers may run to situation in which other apiservers are not available (like https://bugzilla.redhat.com/show_bug.cgi?id=1623108#c0). But they may not know to delete apiservices.
Thus this bug could be used for tracking 3.11 OCP and IMO it needs be addressed in 3.11.0 to tolerate other apiservers' failure.

Comment 5 Michal Fojtik 2018-09-10 11:17:51 UTC
(In reply to Xingxing Xia from comment #4)
> > Target Release: --- → 4.0.0
> Customers may run to situation in which other apiservers are not available
> (like https://bugzilla.redhat.com/show_bug.cgi?id=1623108#c0). But they may
> not know to delete apiservices.
> Thus this bug could be used for tracking 3.11 OCP and IMO it needs be
> addressed in 3.11.0 to tolerate other apiservers' failure.

Customers can setup monitoring of apiservices and alert/fix the failing api server if possible. There is not much we can do when this happen. We can't tolerate unreachable api server for many reasons (like GC won't work properly without being able to delete all orphan resources/etc.).

The fix for this BZ should be updating the documentation with what commands to run to check that all API servers are up and running (oc wait) and additionally plumb all places in installer that depend on fully working API server to wait.

This is not a 3.11 blocker (I believe the behavior was the same in 3.10, we just moved to static pods).

Comment 6 David Eads 2018-09-10 11:58:39 UTC
Why would you manually delete a thing that the ansible installer put in place?  That doesn't seem like a good idea.

Comment 9 Xingxing Xia 2019-01-21 02:46:55 UTC
(In reply to David Eads from comment #6)
> Why would you manually delete a thing that the ansible installer put in
> place?  That doesn't seem like a good idea.
Even without manual deletion, when some apiservice becomes problematic due to whatever reason, the problem will occur. Today this bug description is found on an nextgen 4.0 env with below situation:
    $  oc get apiservices -o=custom-columns="name:.metadata.name,status:.status.conditions[0].status"
    name                                                   status
    ...
    v1beta1.metrics.k8s.io                                 False
    ...
    v2beta1.autoscaling                                    True

Updating bug fields to highlight.

Comment 11 Stefan Schimanski 2019-01-21 09:29:46 UTC
This is expected behaviour by the control plane: if aggregated apiservers are down, we cannot safely delete namespaces as only the aggregated apiserver knows how to clean objects up in etcd.

Imagine we ignored that: namespace "foo" is deleted successfully. Then a use recreate "foo". Meanwhile the apiserver came back. Then the user would see old objects in "foo", maybe objects he has not even creating himself. That would be a security issue.

Comment 12 Junqi Zhao 2019-02-21 10:47:09 UTC
*** Bug 1670994 has been marked as a duplicate of this bug. ***

Comment 15 Mike Fiedler 2019-02-22 13:55:05 UTC
The root cause of bug#1670994 and this one seem to be different.  1670994 was caused by a prometheus adapter TLS rotation issue.   It should actually be marked as a duplicate of a different bz.   I will try to get bz1670994 corrected.

Comment 21 Junqi Zhao 2019-02-26 08:14:52 UTC
Same issue under multitenant env
$ oc get clusternetwork
NAME      CLUSTER NETWORK   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14     172.30.0.0/16     redhat/openshift-ovs-multitenant

$ oc get ns | grep Terminating
openshift-monitoring                          Terminating   4h55m
test34                                        Terminating   3h49m

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-25-234632   True        False         4h43m   Cluster version is 4.0.0-0.nightly-2019-02-25-234632

actually, nothing in the Terminating namespaces
$ oc get all -n test34
No resources found.

$ oc get all -n openshift-monitoring
No resources found.

Comment 24 Michal Fojtik 2019-03-07 09:58:46 UTC
*** Bug 1671600 has been marked as a duplicate of this bug. ***

Comment 25 Xingxing Xia 2019-03-11 01:59:37 UTC
Per comment 16 and comment 19, this bug is expected and there is no fix. When this bug happens again, bugs against other apiservices (components) which are the root cause should be filed, e.g. bug 1668632 and bug 1679511.
This bug isn't seen in recent builds, so moving to VERIFIED.

Comment 28 errata-xmlrpc 2019-06-04 10:40:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758