Bug 1625194
Summary: | Projects get stuck in Terminating status for long time [comment 0 actually Not A Bug per comment 25] | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Meng Bo <bmeng> | ||||||||
Component: | Master | Assignee: | Michal Fojtik <mfojtik> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 3.11.0 | CC: | akrzos, aos-bugs, deads, decarr, jiazha, jokerman, jrosenta, juzhao, mfojtik, mifiedle, mjahangi, mmccomas, sponnaga, sttts, wmeng, wsun | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.1.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | aos-scalability-40 | ||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2019-06-04 10:40:34 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1664187 | ||||||||||
Attachments: |
|
Description
Meng Bo
2018-09-04 11:10:49 UTC
Created attachment 1480742 [details]
full master log
Created attachment 1480743 [details]
v1beta1.metrics.k8s.io.apiservice
Created attachment 1480744 [details]
v1beta1.servicecatalog.k8s.io.apiservice
> Target Release: --- → 4.0.0 Customers may run to situation in which other apiservers are not available (like https://bugzilla.redhat.com/show_bug.cgi?id=1623108#c0). But they may not know to delete apiservices. Thus this bug could be used for tracking 3.11 OCP and IMO it needs be addressed in 3.11.0 to tolerate other apiservers' failure. (In reply to Xingxing Xia from comment #4) > > Target Release: --- → 4.0.0 > Customers may run to situation in which other apiservers are not available > (like https://bugzilla.redhat.com/show_bug.cgi?id=1623108#c0). But they may > not know to delete apiservices. > Thus this bug could be used for tracking 3.11 OCP and IMO it needs be > addressed in 3.11.0 to tolerate other apiservers' failure. Customers can setup monitoring of apiservices and alert/fix the failing api server if possible. There is not much we can do when this happen. We can't tolerate unreachable api server for many reasons (like GC won't work properly without being able to delete all orphan resources/etc.). The fix for this BZ should be updating the documentation with what commands to run to check that all API servers are up and running (oc wait) and additionally plumb all places in installer that depend on fully working API server to wait. This is not a 3.11 blocker (I believe the behavior was the same in 3.10, we just moved to static pods). Why would you manually delete a thing that the ansible installer put in place? That doesn't seem like a good idea. (In reply to David Eads from comment #6) > Why would you manually delete a thing that the ansible installer put in > place? That doesn't seem like a good idea. Even without manual deletion, when some apiservice becomes problematic due to whatever reason, the problem will occur. Today this bug description is found on an nextgen 4.0 env with below situation: $ oc get apiservices -o=custom-columns="name:.metadata.name,status:.status.conditions[0].status" name status ... v1beta1.metrics.k8s.io False ... v2beta1.autoscaling True Updating bug fields to highlight. This is expected behaviour by the control plane: if aggregated apiservers are down, we cannot safely delete namespaces as only the aggregated apiserver knows how to clean objects up in etcd. Imagine we ignored that: namespace "foo" is deleted successfully. Then a use recreate "foo". Meanwhile the apiserver came back. Then the user would see old objects in "foo", maybe objects he has not even creating himself. That would be a security issue. *** Bug 1670994 has been marked as a duplicate of this bug. *** The root cause of bug#1670994 and this one seem to be different. 1670994 was caused by a prometheus adapter TLS rotation issue. It should actually be marked as a duplicate of a different bz. I will try to get bz1670994 corrected. Same issue under multitenant env $ oc get clusternetwork NAME CLUSTER NETWORK SERVICE NETWORK PLUGIN NAME default 10.128.0.0/14 172.30.0.0/16 redhat/openshift-ovs-multitenant $ oc get ns | grep Terminating openshift-monitoring Terminating 4h55m test34 Terminating 3h49m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-25-234632 True False 4h43m Cluster version is 4.0.0-0.nightly-2019-02-25-234632 actually, nothing in the Terminating namespaces $ oc get all -n test34 No resources found. $ oc get all -n openshift-monitoring No resources found. *** Bug 1671600 has been marked as a duplicate of this bug. *** Per comment 16 and comment 19, this bug is expected and there is no fix. When this bug happens again, bugs against other apiservices (components) which are the root cause should be filed, e.g. bug 1668632 and bug 1679511. This bug isn't seen in recent builds, so moving to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |