Bug 1670994
Summary: | Projects stuck in terminating state for overnight clusters | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alex Krzos <akrzos> |
Component: | Monitoring | Assignee: | Frederic Branczyk <fbranczy> |
Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.1.0 | CC: | akamra, aos-bugs, dhansen, dwhatley, erich, hongkliu, jeder, jiazha, jokerman, jtaleric, juzhao, lserven, mifiedle, mloibl, mmccomas, ncredi, nelluri, schoudha, surbania, wabouham, wking, wsun, xtian |
Target Milestone: | --- | Keywords: | Reopened, TestBlocker |
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | aos-scalability-40 | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:42:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alex Krzos
2019-01-30 13:35:59 UTC
Seeing projects stuck in terminating state for every new project created and deleted, on my 4.0 RHCOS cluster, 3 master, 3 worker nodes: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-01-29-025207 True False 1d Cluster version is 4.0.0-0.nightly-2019-01-29-025207 "image": "registry.svc.ci.openshift.org/ocp/release@sha256:aa2c0365957e6c7733fc3dfd21d9f06b95e7664b325620a19becfc5a665caf68", "version": "4.0.0-0.nightly-2019-01-29-025207" One project took longer than 24 hours to terminate. We're also seeing this when running openshift-tests run kubernetes/conformance. oc get apiservices did not show any unavailable services. Seeing the same issue with the latest build - 4.0.0-0.nightly-2019-01-30-145955. Adding TestBlocker keyword - this blocks 4.0 reliability testing. This is blocking OCP 4.0 large scale testing on AWS as well. This issue is causing the migration-eng team to have to restart clusters every time we need to clean a namespace for testing purposes. Workaround or solution would be appreciated! Workaround that seems to work for most is oc delete pod -n openshift-monitoring prometheus-adapter-<id> each time it wedges. The workaround worked for me. My cluster was up for ~10 hours. Hit the issue with rook-ceph project: 2 out of 2 Tried following: Neither worked. 1. delete prometheus-adapter pod # oc delete pod -n openshift-monitoring prometheus-adapter-76cc66755b-b4bs9 2. reboot all nodes in the cluster I see a ton of: ``` E0131 04:28:08.119575 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized E0131 04:28:18.237959 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized E0131 04:28:28.344253 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized E0131 04:28:38.444677 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized E0131 04:28:48.538360 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized ``` In the logs. (In reply to Michal Fojtik from comment #11) > I see a ton of: > > ``` > E0131 04:28:08.119575 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > E0131 04:28:18.237959 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > E0131 04:28:28.344253 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > E0131 04:28:38.444677 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > E0131 04:28:48.538360 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > ``` > > In the logs. Same reason as Bug 1674372, workaround is still $ oc -n openshift-monitoring delete deploy prometheus-adapter The bug will be fixed soon Still a lot of Terminating namespaces with # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-20-194410 True False 5h23m Cluster version is 4.0.0-0.nightly-2019-02-20-194410 # oc get ns | grep Terminating 0fucs Terminating 40m 0j7qq Terminating 42m 0k-0c Terminating 5m52s 13pfh Terminating 34m 1m-ta Terminating 29m 1psoa Terminating 27m 2ukns Terminating 5m6s 30r3w Terminating 41m 3f2ga Terminating 10m 3onv1 Terminating 16m 4m1h9 Terminating 44m 5b6m4 Terminating 17m 720y9 Terminating 52m 7duzi Terminating 46m 7gj2b Terminating 22m # oc get ns | grep Terminating | wc -l 102 workaround does not help $ oc -n openshift-monitoring delete deploy prometheus-adapter As this issue is related to master now, and it has Bug 1625194 to track, close Bug 1670994 as DUPLICATE *** This bug has been marked as a duplicate of bug 1625194 *** Not the Service Catalog issue. Verified on 4.0.0-0.nightly-2019-03-06-074438 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |