Bug 1670994 - Projects stuck in terminating state for overnight clusters
Summary: Projects stuck in terminating state for overnight clusters
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.1.0
Assignee: Frederic Branczyk
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-40
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-30 13:35 UTC by Alex Krzos
Modified: 2019-06-04 10:42 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:42:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:42:36 UTC

Description Alex Krzos 2019-01-30 13:35:59 UTC
Description of problem:
Some of the longer running clusters and tests are showing a behavior where a project being deleted from a testing cluster left on overnight results in the project being left in terminating state.

Version-Release number of selected component (if applicable):
OCP 4.0
4.0.0-0.nightly-2019-01-25-205123


How reproducible:
Always

Steps to Reproduce:
1. Build cluster
2. Deploy project with pods
3. Leave cluster overnight
4. attempt to delete projects

Actual results/ Example:

root@ip-172-31-32-21: ~/ocp-automation # oc get projects
NAME                                         DISPLAY NAME   STATUS
controller                                                  Active
default                                                     Active
...
openshift-service-cert-signer                               Active
pbench                                                      Active
uperf-1                                                     Active
root@ip-172-31-32-21: ~/ocp-automation # oc delete project uperf-1
project.project.openshift.io "uperf-1" deleted
E0130 13:11:56.989423    2593 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug=""
E0130 13:12:36.280512    2593 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
root@ip-172-31-32-21: ~/ocp-automation # oc get projects
NAME                                         DISPLAY NAME   STATUS
controller                                                  Active
default                                                     Active
...
openshift-service-cert-signer                               Active
pbench                                                      Active
uperf-1                                                     Terminating



Expected results:
Project to delete quickly

Additional info:

The exmaple cluster had been online for 22 hours

root@ip-172-31-32-21: ~/ocp-automation # oc get nodes
NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-13-144.us-west-2.compute.internal    Ready     master    22h       v1.12.4+50c2f2340a
ip-10-0-134-33.us-west-2.compute.internal    Ready     worker    21h       v1.12.4+50c2f2340a
ip-10-0-138-95.us-west-2.compute.internal    Ready     infra     21h       v1.12.4+50c2f2340a
ip-10-0-140-10.us-west-2.compute.internal    Ready     pbench    21h       v1.12.4+50c2f2340a
ip-10-0-144-56.us-west-2.compute.internal    Ready     infra     21h       v1.12.4+50c2f2340a
ip-10-0-152-20.us-west-2.compute.internal    Ready     worker    21h       v1.12.4+50c2f2340a
ip-10-0-167-221.us-west-2.compute.internal   Ready     infra     21h       v1.12.4+50c2f2340a
ip-10-0-168-150.us-west-2.compute.internal   Ready     worker    21h       v1.12.4+50c2f2340a
ip-10-0-19-77.us-west-2.compute.internal     Ready     master    22h       v1.12.4+50c2f2340a
ip-10-0-42-28.us-west-2.compute.internal     Ready     master    22h       v1.12.4+50c2f2340a

Comment 1 Walid A. 2019-01-31 14:01:36 UTC
Seeing projects stuck in terminating state for every new project created and deleted, on my 4.0 RHCOS cluster, 3 master, 3 worker nodes:

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-01-29-025207   True        False         1d      Cluster version is 4.0.0-0.nightly-2019-01-29-025207

"image": "registry.svc.ci.openshift.org/ocp/release@sha256:aa2c0365957e6c7733fc3dfd21d9f06b95e7664b325620a19becfc5a665caf68",
"version": "4.0.0-0.nightly-2019-01-29-025207"

One project took longer than 24 hours to terminate.

Comment 2 Mike Fiedler 2019-01-31 14:20:53 UTC
We're also seeing this when running openshift-tests run kubernetes/conformance.   oc get apiservices did not show any unavailable services.

Comment 4 Naga Ravi Chaitanya Elluri 2019-02-01 16:19:24 UTC
Seeing the same issue with the latest build - 4.0.0-0.nightly-2019-01-30-145955.

Comment 5 Mike Fiedler 2019-02-04 15:38:20 UTC
Adding TestBlocker keyword - this blocks 4.0 reliability testing.

Comment 6 Naga Ravi Chaitanya Elluri 2019-02-04 15:46:53 UTC
This is blocking OCP 4.0 large scale testing on AWS as well.

Comment 7 Derek Whatley 2019-02-12 16:59:06 UTC
This issue is causing the migration-eng team to have to restart clusters every time we need to clean a namespace for testing purposes.

Workaround or solution would be appreciated!

Comment 8 Mike Fiedler 2019-02-12 18:02:08 UTC
Workaround that seems to work for most is oc delete pod -n openshift-monitoring prometheus-adapter-<id>   each time it wedges.

Comment 9 Daneyon Hansen 2019-02-14 05:45:07 UTC
The workaround worked for me. My cluster was up for ~10 hours.

Comment 10 Hongkai Liu 2019-02-14 21:29:01 UTC
Hit the issue with rook-ceph project: 2 out of 2

Tried following: Neither worked.
1. delete prometheus-adapter pod
# oc delete pod -n openshift-monitoring                         prometheus-adapter-76cc66755b-b4bs9
2. reboot all nodes in the cluster

Comment 11 Michal Fojtik 2019-02-15 11:11:54 UTC
I see a ton of:

```
E0131 04:28:08.119575       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
E0131 04:28:18.237959       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
E0131 04:28:28.344253       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
E0131 04:28:38.444677       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
E0131 04:28:48.538360       1 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: Unauthorized
```

In the logs.

Comment 12 Junqi Zhao 2019-02-15 11:42:56 UTC
(In reply to Michal Fojtik from comment #11)
> I see a ton of:
> 
> ```
> E0131 04:28:08.119575       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> E0131 04:28:18.237959       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> E0131 04:28:28.344253       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> E0131 04:28:38.444677       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> E0131 04:28:48.538360       1 memcache.go:134] couldn't get resource list
> for metrics.k8s.io/v1beta1: Unauthorized
> ```
> 
> In the logs.

Same reason as Bug 1674372, workaround is still
$ oc -n openshift-monitoring delete deploy prometheus-adapter

The bug will be fixed soon

Comment 16 Junqi Zhao 2019-02-21 08:52:33 UTC
Still a lot of Terminating namespaces with 
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-02-20-194410   True        False         5h23m     Cluster version is 4.0.0-0.nightly-2019-02-20-194410

# oc get ns | grep Terminating
0fucs                                         Terminating   40m
0j7qq                                         Terminating   42m
0k-0c                                         Terminating   5m52s
13pfh                                         Terminating   34m
1m-ta                                         Terminating   29m
1psoa                                         Terminating   27m
2ukns                                         Terminating   5m6s
30r3w                                         Terminating   41m
3f2ga                                         Terminating   10m
3onv1                                         Terminating   16m
4m1h9                                         Terminating   44m
5b6m4                                         Terminating   17m
720y9                                         Terminating   52m
7duzi                                         Terminating   46m
7gj2b                                         Terminating   22m

# oc get ns | grep Terminating | wc -l
102

workaround does not help
$ oc -n openshift-monitoring delete deploy prometheus-adapter

Comment 19 Junqi Zhao 2019-02-21 10:47:09 UTC
As this issue is related to master now, and it has Bug 1625194 to track, close Bug 1670994 as DUPLICATE

*** This bug has been marked as a duplicate of bug 1625194 ***

Comment 23 Jian Zhang 2019-02-25 01:52:33 UTC
Not the Service Catalog issue.

Comment 25 Mike Fiedler 2019-03-06 15:40:13 UTC
Verified on 4.0.0-0.nightly-2019-03-06-074438

Comment 32 errata-xmlrpc 2019-06-04 10:42:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.