Bug 1670994

Summary:	Projects stuck in terminating state for overnight clusters
Product:	OpenShift Container Platform	Reporter:	Alex Krzos <akrzos>
Component:	Monitoring	Assignee:	Frederic Branczyk <fbranczy>
Status:	CLOSED ERRATA	QA Contact:	Mike Fiedler <mifiedle>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.1.0	CC:	akamra, aos-bugs, dhansen, dwhatley, erich, hongkliu, jeder, jiazha, jokerman, jtaleric, juzhao, lserven, mifiedle, mloibl, mmccomas, ncredi, nelluri, schoudha, surbania, wabouham, wking, wsun, xtian
Target Milestone:	---	Keywords:	Reopened, TestBlocker
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-40
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:42:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Krzos 2019-01-30 13:35:59 UTC

Description of problem:
Some of the longer running clusters and tests are showing a behavior where a project being deleted from a testing cluster left on overnight results in the project being left in terminating state.

Version-Release number of selected component (if applicable):
OCP 4.0
4.0.0-0.nightly-2019-01-25-205123


How reproducible:
Always

Steps to Reproduce:
1. Build cluster
2. Deploy project with pods
3. Leave cluster overnight
4. attempt to delete projects

Actual results/ Example:

root@ip-172-31-32-21: ~/ocp-automation # oc get projects
NAME                                         DISPLAY NAME   STATUS
controller                                                  Active
default                                                     Active
...
openshift-service-cert-signer                               Active
pbench                                                      Active
uperf-1                                                     Active
root@ip-172-31-32-21: ~/ocp-automation # oc delete project uperf-1
project.project.openshift.io "uperf-1" deleted
E0130 13:11:56.989423    2593 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug=""
E0130 13:12:36.280512    2593 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
root@ip-172-31-32-21: ~/ocp-automation # oc get projects
NAME                                         DISPLAY NAME   STATUS
controller                                                  Active
default                                                     Active
...
openshift-service-cert-signer                               Active
pbench                                                      Active
uperf-1                                                     Terminating



Expected results:
Project to delete quickly

Additional info:

The exmaple cluster had been online for 22 hours

root@ip-172-31-32-21: ~/ocp-automation # oc get nodes
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-13-144.us-west-2.compute.internal    Ready     master    22h       v1.12.4+50c2f2340a
ip-10-0-134-33.us-west-2.compute.internal    Ready     worker    21h       v1.12.4+50c2f2340a
ip-10-0-138-95.us-west-2.compute.internal    Ready     infra     21h       v1.12.4+50c2f2340a
ip-10-0-140-10.us-west-2.compute.internal    Ready     pbench    21h       v1.12.4+50c2f2340a
ip-10-0-144-56.us-west-2.compute.internal    Ready     infra     21h       v1.12.4+50c2f2340a
ip-10-0-152-20.us-west-2.compute.internal    Ready     worker    21h       v1.12.4+50c2f2340a
ip-10-0-167-221.us-west-2.compute.internal   Ready     infra     21h       v1.12.4+50c2f2340a
ip-10-0-168-150.us-west-2.compute.internal   Ready     worker    21h       v1.12.4+50c2f2340a
ip-10-0-19-77.us-west-2.compute.internal     Ready     master    22h       v1.12.4+50c2f2340a
ip-10-0-42-28.us-west-2.compute.internal     Ready     master    22h       v1.12.4+50c2f2340a

Comment 1 Walid A. 2019-01-31 14:01:36 UTC

Seeing projects stuck in terminating state for every new project created and deleted, on my 4.0 RHCOS cluster, 3 master, 3 worker nodes:

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-01-29-025207   True        False         1d      Cluster version is 4.0.0-0.nightly-2019-01-29-025207

"image": "registry.svc.ci.openshift.org/ocp/release@sha256:aa2c0365957e6c7733fc3dfd21d9f06b95e7664b325620a19becfc5a665caf68",
"version": "4.0.0-0.nightly-2019-01-29-025207"

One project took longer than 24 hours to terminate.

Comment 2 Mike Fiedler 2019-01-31 14:20:53 UTC

We're also seeing this when running openshift-tests run kubernetes/conformance.   oc get apiservices did not show any unavailable services.

Comment 4 Naga Ravi Chaitanya Elluri 2019-02-01 16:19:24 UTC

Seeing the same issue with the latest build - 4.0.0-0.nightly-2019-01-30-145955.

Comment 5 Mike Fiedler 2019-02-04 15:38:20 UTC

Adding TestBlocker keyword - this blocks 4.0 reliability testing.

Comment 6 Naga Ravi Chaitanya Elluri 2019-02-04 15:46:53 UTC

This is blocking OCP 4.0 large scale testing on AWS as well.

Comment 7 Derek Whatley 2019-02-12 16:59:06 UTC

This issue is causing the migration-eng team to have to restart clusters every time we need to clean a namespace for testing purposes.

Workaround or solution would be appreciated!

Comment 8 Mike Fiedler 2019-02-12 18:02:08 UTC

Workaround that seems to work for most is oc delete pod -n openshift-monitoring prometheus-adapter-<id>   each time it wedges.

Comment 9 Daneyon Hansen 2019-02-14 05:45:07 UTC

The workaround worked for me. My cluster was up for ~10 hours.

Comment 10 Hongkai Liu 2019-02-14 21:29:01 UTC

Hit the issue with rook-ceph project: 2 out of 2

Tried following: Neither worked.
1. delete prometheus-adapter pod
# oc delete pod -n openshift-monitoring                         prometheus-adapter-76cc66755b-b4bs9
2. reboot all nodes in the cluster

Comment 11 Michal Fojtik 2019-02-15 11:11:54 UTC

I see a ton of:

```
E0131 04:28:08.119575       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
E0131 04:28:18.237959       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
E0131 04:28:28.344253       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
E0131 04:28:38.444677       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
E0131 04:28:48.538360       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
```

In the logs.

Comment 12 Junqi Zhao 2019-02-15 11:42:56 UTC

(In reply to Michal Fojtik from comment #11)
> I see a ton of:
> 
> ```
> E0131 04:28:08.119575       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> E0131 04:28:18.237959       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> E0131 04:28:28.344253       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> E0131 04:28:38.444677       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> E0131 04:28:48.538360       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> ```
> 
> In the logs.

Same reason as Bug 1674372, workaround is still
$ oc -n openshift-monitoring delete deploy prometheus-adapter

The bug will be fixed soon

Comment 16 Junqi Zhao 2019-02-21 08:52:33 UTC

Still a lot of Terminating namespaces with 
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-02-20-194410   True        False         5h23m     Cluster version is 4.0.0-0.nightly-2019-02-20-194410

# oc get ns | grep Terminating
0fucs                                         Terminating   40m
0j7qq                                         Terminating   42m
0k-0c                                         Terminating   5m52s
13pfh                                         Terminating   34m
1m-ta                                         Terminating   29m
1psoa                                         Terminating   27m
2ukns                                         Terminating   5m6s
30r3w                                         Terminating   41m
3f2ga                                         Terminating   10m
3onv1                                         Terminating   16m
4m1h9                                         Terminating   44m
5b6m4                                         Terminating   17m
720y9                                         Terminating   52m
7duzi                                         Terminating   46m
7gj2b                                         Terminating   22m

# oc get ns | grep Terminating | wc -l
102

workaround does not help
$ oc -n openshift-monitoring delete deploy prometheus-adapter

Comment 19 Junqi Zhao 2019-02-21 10:47:09 UTC

As this issue is related to master now, and it has Bug 1625194 to track, close Bug 1670994 as DUPLICATE

*** This bug has been marked as a duplicate of bug 1625194 ***

Comment 23 Jian Zhang 2019-02-25 01:52:33 UTC

Not the Service Catalog issue.

Comment 25 Mike Fiedler 2019-03-06 15:40:13 UTC

Verified on 4.0.0-0.nightly-2019-03-06-074438

Comment 32 errata-xmlrpc 2019-06-04 10:42:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758