1727090 – e2e flake: namespace was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed

Bug 1727090 - e2e flake: namespace was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed

Summary: e2e flake: namespace was not deleted with limit: timed out waiting for the co...

Keywords:
Status:	CLOSED DUPLICATE of bug 1752982
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Robert Krawitz
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (7):	1696470 1713135 1726802 1734427 1743929 1752591 1752811 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-04 12:30 UTC by Frederic Branczyk
Modified:	2023-09-14 05:31 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-18 14:15:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
e2e-profile.png (82.53 KB, image/png) 2019-08-15 21:06 UTC, Seth Jennings	no flags	Details
View All

Description Frederic Branczyk 2019-07-04 12:30:37 UTC

The test titled: [sig-auth] ServiceAccounts should allow opting out of API token automount [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] failed

in the job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/1571

```
fail [k8s.io/kubernetes/test/e2e/framework/framework.go:338]: Jul  4 10:26:05.143: Couldn't delete ns: "svcaccounts-8171": namespace svcaccounts-8171 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace svcaccounts-8171 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"})
```

As this is a timeout on the namespace deleting, it's unlikely that this is actually the test itself, however, it would be useful for next time if maybe the finalizers still left on the namespace would be printed.

Comment 2 Seth Jennings 2019-07-30 15:35:33 UTC

*** Bug 1734427 has been marked as a duplicate of this bug. ***

Comment 3 Seth Jennings 2019-08-02 21:14:46 UTC

The real question is why does it only happen on this test? Every test creates a new test namespace and deletes it. Why is it only this test where the pod deletion takes a long time?

Asking myself and any one else that might know.

Comment 4 Seth Jennings 2019-08-02 21:48:25 UTC

Also seems to happen occasionally on
[sig-cli] Kubectl client [k8s.io] Simple pod should contain last line of the log [Suite:openshift/conformance/parallel] [Suite:k8s]

Comment 5 Seth Jennings 2019-08-12 16:13:46 UTC

*** Bug 1713135 has been marked as a duplicate of this bug. ***

Comment 6 Seth Jennings 2019-08-15 21:06:48 UTC

Created attachment 1604194 [details]
e2e-profile.png

I ran this against my local cluster with `openshift run openshift/conformance/parallel` and the cluster got super unstable.

Prometheus died in the middle so I lost my metrics.  Only metrics I have are from the hypervisor running the VMs that are the cluster nodes.

test start at 2:06
test end at 3:06 but there is a huge backlog of terminating namespaces (~30)
load stays at 100% until 3:22 (+16m)
load at about 75% until 3:32 when load goes back to normal (+26m)
only e2e-provisioning namespace stuck terminating

This was on 3 4vcpu/8GB masters and 2 4cpu/6GB workers.  I didn't see any evidence of memory thrashing.

So I am able to reproduce this.  The trick will be keeping monitoring up so I can figure out what is going on.

Comment 7 Seth Jennings 2019-08-15 21:16:43 UTC

Just to clarify, my monitoring stack was evicted when the worker nodes went NotReady afaict

$ oc get event | grep Evict
72m         Normal    TaintManagerEviction     pod/alertmanager-main-0                         Marking for deletion Pod openshift-monitoring/alertmanager-main-0
81m         Normal    TaintManagerEviction     pod/alertmanager-main-1                         Marking for deletion Pod openshift-monitoring/alertmanager-main-1
72m         Normal    TaintManagerEviction     pod/alertmanager-main-2                         Marking for deletion Pod openshift-monitoring/alertmanager-main-2
72m         Normal    TaintManagerEviction     pod/grafana-768f56b5c5-v8mh8                    Marking for deletion Pod openshift-monitoring/grafana-768f56b5c5-v8mh8
72m         Normal    TaintManagerEviction     pod/kube-state-metrics-6c9974679c-jzg75         Marking for deletion Pod openshift-monitoring/kube-state-metrics-6c9974679c-jzg75
81m         Normal    TaintManagerEviction     pod/kube-state-metrics-6c9974679c-xrrgb         Marking for deletion Pod openshift-monitoring/kube-state-metrics-6c9974679c-xrrgb
72m         Normal    TaintManagerEviction     pod/openshift-state-metrics-84fd7f5c9c-gfrqj    Marking for deletion Pod openshift-monitoring/openshift-state-metrics-84fd7f5c9c-gfrqj
81m         Normal    TaintManagerEviction     pod/openshift-state-metrics-84fd7f5c9c-nsnd4    Marking for deletion Pod openshift-monitoring/openshift-state-metrics-84fd7f5c9c-nsnd4
72m         Normal    TaintManagerEviction     pod/prometheus-adapter-596cf8794c-4htc9         Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-4htc9
81m         Normal    TaintManagerEviction     pod/prometheus-adapter-596cf8794c-8gsz7         Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-8gsz7
72m         Normal    TaintManagerEviction     pod/prometheus-adapter-596cf8794c-t22cn         Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-t22cn
81m         Normal    TaintManagerEviction     pod/prometheus-k8s-0                            Marking for deletion Pod openshift-monitoring/prometheus-k8s-0
72m         Normal    TaintManagerEviction     pod/prometheus-k8s-1                            Marking for deletion Pod openshift-monitoring/prometheus-k8s-1
72m         Normal    TaintManagerEviction     pod/prometheus-operator-764cbd99f9-xhxl8        Marking for deletion Pod openshift-monitoring/prometheus-operator-764cbd99f9-xhxl8
81m         Normal    TaintManagerEviction     pod/telemeter-client-798455f9fd-58875           Marking for deletion Pod openshift-monitoring/telemeter-client-798455f9fd-58875
72m         Normal    TaintManagerEviction     pod/telemeter-client-798455f9fd-xk9l8           Marking for deletion Pod openshift-monitoring/telemeter-client-798455f9fd-xk9l8

$ oc get event --all-namespaces | grep Node
default                             48m         Normal    NodeHasSufficientMemory   node/worker-0                                   Node worker-0 status is now: NodeHasSufficientMemory
default                             48m         Normal    NodeHasNoDiskPressure     node/worker-0                                   Node worker-0 status is now: NodeHasNoDiskPressure
default                             48m         Normal    NodeHasSufficientPID      node/worker-0                                   Node worker-0 status is now: NodeHasSufficientPID
default                             48m         Normal    NodeReady                 node/worker-0                                   Node worker-0 status is now: NodeReady
default                             88m         Normal    NodeNotReady              node/worker-0                                   Node worker-0 status is now: NodeNotReady
default                             56m         Normal    NodeNotReady              node/worker-0                                   Node worker-0 status is now: NodeNotReady
default                             40m         Normal    NodeHasSufficientMemory   node/worker-1                                   Node worker-1 status is now: NodeHasSufficientMemory
default                             40m         Normal    NodeHasNoDiskPressure     node/worker-1                                   Node worker-1 status is now: NodeHasNoDiskPressure
default                             40m         Normal    NodeHasSufficientPID      node/worker-1                                   Node worker-1 status is now: NodeHasSufficientPID
default                             39m         Normal    NodeReady                 node/worker-1                                   Node worker-1 status is now: NodeReady
default                             79m         Normal    NodeNotReady              node/worker-1                                   Node worker-1 status is now: NodeNotReady
default                             40m         Normal    NodeNotReady              node/worker-1                                   Node worker-1 status is now: NodeNotReady

Going to try with larger workers.  Though this should not happen :-/

Comment 8 Maciej Szulik 2019-08-16 13:40:57 UTC

*** Bug 1726802 has been marked as a duplicate of this bug. ***

Comment 9 Xingxing Xia 2019-08-19 09:55:00 UTC

https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/49/build-log.txt
https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/42/build-log.txt
https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/30/build-log.txt
In these days above jobs, the case "[Feature:Builds][webhook] TestWebhook [Suite:openshift/conformance/parallel]" also hits:
namespace e2e-test-build-webhooks-7g7vj was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-test-build-webhooks-7g7vj was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed

Comment 11 zhou ying 2019-08-21 07:37:28 UTC

*** Bug 1743929 has been marked as a duplicate of this bug. ***

Comment 13 Tomáš Nožička 2019-08-26 15:29:17 UTC

*** Bug 1696470 has been marked as a duplicate of this bug. ***

Comment 14 Micah Abbott 2019-09-05 13:29:19 UTC

The GCP CI is seeing similar timeout errors in the "ServiceAccounts" test from comment #1 and also in the "Image change build triggers" tests:

- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/177
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/180

Comment 20 sheng.lao 2019-09-16 10:04:09 UTC

Another one https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/115

Comment 21 Tomáš Nožička 2019-09-17 08:54:19 UTC

*** Bug 1752591 has been marked as a duplicate of this bug. ***

Comment 22 Adam Kaplan 2019-09-17 14:14:48 UTC

*** Bug 1752811 has been marked as a duplicate of this bug. ***

Comment 23 David Eads 2019-09-17 14:47:52 UTC

Opened https://github.com/openshift/origin/pull/23813 .  It looks like pods are slow getting cleaned up by kubelet and we backoff for 5 minutes which causes us to miss the check window in e2e.

Comment 24 Robert Krawitz 2019-09-18 14:15:37 UTC


*** This bug has been marked as a duplicate of bug 1752982 ***

Comment 25 Red Hat Bugzilla 2023-09-14 05:31:20 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.