Hide Forgot
Description of problem: Some of the longer running clusters and tests are showing a behavior where a project being deleted from a testing cluster left on overnight results in the project being left in terminating state. Version-Release number of selected component (if applicable): OCP 4.0 4.0.0-0.nightly-2019-01-25-205123 How reproducible: Always Steps to Reproduce: 1. Build cluster 2. Deploy project with pods 3. Leave cluster overnight 4. attempt to delete projects Actual results/ Example: root@ip-172-31-32-21: ~/ocp-automation # oc get projects NAME DISPLAY NAME STATUS controller Active default Active ... openshift-service-cert-signer Active pbench Active uperf-1 Active root@ip-172-31-32-21: ~/ocp-automation # oc delete project uperf-1 project.project.openshift.io "uperf-1" deleted E0130 13:11:56.989423 2593 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug="" E0130 13:12:36.280512 2593 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug="" root@ip-172-31-32-21: ~/ocp-automation # oc get projects NAME DISPLAY NAME STATUS controller Active default Active ... openshift-service-cert-signer Active pbench Active uperf-1 Terminating Expected results: Project to delete quickly Additional info: The exmaple cluster had been online for 22 hours root@ip-172-31-32-21: ~/ocp-automation # oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-13-144.us-west-2.compute.internal Ready master 22h v1.12.4+50c2f2340a ip-10-0-134-33.us-west-2.compute.internal Ready worker 21h v1.12.4+50c2f2340a ip-10-0-138-95.us-west-2.compute.internal Ready infra 21h v1.12.4+50c2f2340a ip-10-0-140-10.us-west-2.compute.internal Ready pbench 21h v1.12.4+50c2f2340a ip-10-0-144-56.us-west-2.compute.internal Ready infra 21h v1.12.4+50c2f2340a ip-10-0-152-20.us-west-2.compute.internal Ready worker 21h v1.12.4+50c2f2340a ip-10-0-167-221.us-west-2.compute.internal Ready infra 21h v1.12.4+50c2f2340a ip-10-0-168-150.us-west-2.compute.internal Ready worker 21h v1.12.4+50c2f2340a ip-10-0-19-77.us-west-2.compute.internal Ready master 22h v1.12.4+50c2f2340a ip-10-0-42-28.us-west-2.compute.internal Ready master 22h v1.12.4+50c2f2340a
Seeing projects stuck in terminating state for every new project created and deleted, on my 4.0 RHCOS cluster, 3 master, 3 worker nodes: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-01-29-025207 True False 1d Cluster version is 4.0.0-0.nightly-2019-01-29-025207 "image": "registry.svc.ci.openshift.org/ocp/release@sha256:aa2c0365957e6c7733fc3dfd21d9f06b95e7664b325620a19becfc5a665caf68", "version": "4.0.0-0.nightly-2019-01-29-025207" One project took longer than 24 hours to terminate.
We're also seeing this when running openshift-tests run kubernetes/conformance. oc get apiservices did not show any unavailable services.
Seeing the same issue with the latest build - 4.0.0-0.nightly-2019-01-30-145955.
Adding TestBlocker keyword - this blocks 4.0 reliability testing.
This is blocking OCP 4.0 large scale testing on AWS as well.
This issue is causing the migration-eng team to have to restart clusters every time we need to clean a namespace for testing purposes. Workaround or solution would be appreciated!
Workaround that seems to work for most is oc delete pod -n openshift-monitoring prometheus-adapter-<id> each time it wedges.
The workaround worked for me. My cluster was up for ~10 hours.
Hit the issue with rook-ceph project: 2 out of 2 Tried following: Neither worked. 1. delete prometheus-adapter pod # oc delete pod -n openshift-monitoring prometheus-adapter-76cc66755b-b4bs9 2. reboot all nodes in the cluster
I see a ton of: ``` E0131 04:28:08.119575 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized E0131 04:28:18.237959 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized E0131 04:28:28.344253 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized E0131 04:28:38.444677 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized E0131 04:28:48.538360 1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized ``` In the logs.
(In reply to Michal Fojtik from comment #11) > I see a ton of: > > ``` > E0131 04:28:08.119575 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > E0131 04:28:18.237959 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > E0131 04:28:28.344253 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > E0131 04:28:38.444677 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > E0131 04:28:48.538360 1 memcache.go:134] couldn't get resource list > for metrics.k8s.io/v1beta1: Unauthorized > ``` > > In the logs. Same reason as Bug 1674372, workaround is still $ oc -n openshift-monitoring delete deploy prometheus-adapter The bug will be fixed soon
Still a lot of Terminating namespaces with # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-20-194410 True False 5h23m Cluster version is 4.0.0-0.nightly-2019-02-20-194410 # oc get ns | grep Terminating 0fucs Terminating 40m 0j7qq Terminating 42m 0k-0c Terminating 5m52s 13pfh Terminating 34m 1m-ta Terminating 29m 1psoa Terminating 27m 2ukns Terminating 5m6s 30r3w Terminating 41m 3f2ga Terminating 10m 3onv1 Terminating 16m 4m1h9 Terminating 44m 5b6m4 Terminating 17m 720y9 Terminating 52m 7duzi Terminating 46m 7gj2b Terminating 22m # oc get ns | grep Terminating | wc -l 102 workaround does not help $ oc -n openshift-monitoring delete deploy prometheus-adapter
As this issue is related to master now, and it has Bug 1625194 to track, close Bug 1670994 as DUPLICATE *** This bug has been marked as a duplicate of bug 1625194 ***
Not the Service Catalog issue.
Verified on 4.0.0-0.nightly-2019-03-06-074438
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758