Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1727090

Summary: e2e flake: namespace was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed
Product: OpenShift Container Platform Reporter: Frederic Branczyk <fbranczy>
Component: NodeAssignee: Robert Krawitz <rkrawitz>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.2.0CC: adahiya, aos-bugs, ccoleman, deads, jcallen, jcantril, jokerman, lsm5, mfojtik, miabbott, mmccomas, nagrawal, shlao, sjenning, tnozicka, wking, yinzhou
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-18 14:15:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
e2e-profile.png none

Description Frederic Branczyk 2019-07-04 12:30:37 UTC
The test titled: [sig-auth] ServiceAccounts should allow opting out of API token automount [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] failed

in the job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/1571

```
fail [k8s.io/kubernetes/test/e2e/framework/framework.go:338]: Jul  4 10:26:05.143: Couldn't delete ns: "svcaccounts-8171": namespace svcaccounts-8171 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace svcaccounts-8171 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"})
```

As this is a timeout on the namespace deleting, it's unlikely that this is actually the test itself, however, it would be useful for next time if maybe the finalizers still left on the namespace would be printed.

Comment 2 Seth Jennings 2019-07-30 15:35:33 UTC
*** Bug 1734427 has been marked as a duplicate of this bug. ***

Comment 3 Seth Jennings 2019-08-02 21:14:46 UTC
The real question is why does it only happen on this test? Every test creates a new test namespace and deletes it. Why is it only this test where the pod deletion takes a long time?

Asking myself and any one else that might know.

Comment 4 Seth Jennings 2019-08-02 21:48:25 UTC
Also seems to happen occasionally on
[sig-cli] Kubectl client [k8s.io] Simple pod should contain last line of the log [Suite:openshift/conformance/parallel] [Suite:k8s]

Comment 5 Seth Jennings 2019-08-12 16:13:46 UTC
*** Bug 1713135 has been marked as a duplicate of this bug. ***

Comment 6 Seth Jennings 2019-08-15 21:06:48 UTC
Created attachment 1604194 [details]
e2e-profile.png

I ran this against my local cluster with `openshift run openshift/conformance/parallel` and the cluster got super unstable.

Prometheus died in the middle so I lost my metrics.  Only metrics I have are from the hypervisor running the VMs that are the cluster nodes.

test start at 2:06
test end at 3:06 but there is a huge backlog of terminating namespaces (~30)
load stays at 100% until 3:22 (+16m)
load at about 75% until 3:32 when load goes back to normal (+26m)
only e2e-provisioning namespace stuck terminating

This was on 3 4vcpu/8GB masters and 2 4cpu/6GB workers.  I didn't see any evidence of memory thrashing.

So I am able to reproduce this.  The trick will be keeping monitoring up so I can figure out what is going on.

Comment 7 Seth Jennings 2019-08-15 21:16:43 UTC
Just to clarify, my monitoring stack was evicted when the worker nodes went NotReady afaict

$ oc get event | grep Evict
72m         Normal    TaintManagerEviction     pod/alertmanager-main-0                         Marking for deletion Pod openshift-monitoring/alertmanager-main-0
81m         Normal    TaintManagerEviction     pod/alertmanager-main-1                         Marking for deletion Pod openshift-monitoring/alertmanager-main-1
72m         Normal    TaintManagerEviction     pod/alertmanager-main-2                         Marking for deletion Pod openshift-monitoring/alertmanager-main-2
72m         Normal    TaintManagerEviction     pod/grafana-768f56b5c5-v8mh8                    Marking for deletion Pod openshift-monitoring/grafana-768f56b5c5-v8mh8
72m         Normal    TaintManagerEviction     pod/kube-state-metrics-6c9974679c-jzg75         Marking for deletion Pod openshift-monitoring/kube-state-metrics-6c9974679c-jzg75
81m         Normal    TaintManagerEviction     pod/kube-state-metrics-6c9974679c-xrrgb         Marking for deletion Pod openshift-monitoring/kube-state-metrics-6c9974679c-xrrgb
72m         Normal    TaintManagerEviction     pod/openshift-state-metrics-84fd7f5c9c-gfrqj    Marking for deletion Pod openshift-monitoring/openshift-state-metrics-84fd7f5c9c-gfrqj
81m         Normal    TaintManagerEviction     pod/openshift-state-metrics-84fd7f5c9c-nsnd4    Marking for deletion Pod openshift-monitoring/openshift-state-metrics-84fd7f5c9c-nsnd4
72m         Normal    TaintManagerEviction     pod/prometheus-adapter-596cf8794c-4htc9         Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-4htc9
81m         Normal    TaintManagerEviction     pod/prometheus-adapter-596cf8794c-8gsz7         Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-8gsz7
72m         Normal    TaintManagerEviction     pod/prometheus-adapter-596cf8794c-t22cn         Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-t22cn
81m         Normal    TaintManagerEviction     pod/prometheus-k8s-0                            Marking for deletion Pod openshift-monitoring/prometheus-k8s-0
72m         Normal    TaintManagerEviction     pod/prometheus-k8s-1                            Marking for deletion Pod openshift-monitoring/prometheus-k8s-1
72m         Normal    TaintManagerEviction     pod/prometheus-operator-764cbd99f9-xhxl8        Marking for deletion Pod openshift-monitoring/prometheus-operator-764cbd99f9-xhxl8
81m         Normal    TaintManagerEviction     pod/telemeter-client-798455f9fd-58875           Marking for deletion Pod openshift-monitoring/telemeter-client-798455f9fd-58875
72m         Normal    TaintManagerEviction     pod/telemeter-client-798455f9fd-xk9l8           Marking for deletion Pod openshift-monitoring/telemeter-client-798455f9fd-xk9l8

$ oc get event --all-namespaces | grep Node
default                             48m         Normal    NodeHasSufficientMemory   node/worker-0                                   Node worker-0 status is now: NodeHasSufficientMemory
default                             48m         Normal    NodeHasNoDiskPressure     node/worker-0                                   Node worker-0 status is now: NodeHasNoDiskPressure
default                             48m         Normal    NodeHasSufficientPID      node/worker-0                                   Node worker-0 status is now: NodeHasSufficientPID
default                             48m         Normal    NodeReady                 node/worker-0                                   Node worker-0 status is now: NodeReady
default                             88m         Normal    NodeNotReady              node/worker-0                                   Node worker-0 status is now: NodeNotReady
default                             56m         Normal    NodeNotReady              node/worker-0                                   Node worker-0 status is now: NodeNotReady
default                             40m         Normal    NodeHasSufficientMemory   node/worker-1                                   Node worker-1 status is now: NodeHasSufficientMemory
default                             40m         Normal    NodeHasNoDiskPressure     node/worker-1                                   Node worker-1 status is now: NodeHasNoDiskPressure
default                             40m         Normal    NodeHasSufficientPID      node/worker-1                                   Node worker-1 status is now: NodeHasSufficientPID
default                             39m         Normal    NodeReady                 node/worker-1                                   Node worker-1 status is now: NodeReady
default                             79m         Normal    NodeNotReady              node/worker-1                                   Node worker-1 status is now: NodeNotReady
default                             40m         Normal    NodeNotReady              node/worker-1                                   Node worker-1 status is now: NodeNotReady

Going to try with larger workers.  Though this should not happen :-/

Comment 8 Maciej Szulik 2019-08-16 13:40:57 UTC
*** Bug 1726802 has been marked as a duplicate of this bug. ***

Comment 9 Xingxing Xia 2019-08-19 09:55:00 UTC
https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/49/build-log.txt
https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/42/build-log.txt
https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/30/build-log.txt
In these days above jobs, the case "[Feature:Builds][webhook] TestWebhook [Suite:openshift/conformance/parallel]" also hits:
namespace e2e-test-build-webhooks-7g7vj was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-test-build-webhooks-7g7vj was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed

Comment 11 zhou ying 2019-08-21 07:37:28 UTC
*** Bug 1743929 has been marked as a duplicate of this bug. ***

Comment 13 Tomáš Nožička 2019-08-26 15:29:17 UTC
*** Bug 1696470 has been marked as a duplicate of this bug. ***

Comment 14 Micah Abbott 2019-09-05 13:29:19 UTC
The GCP CI is seeing similar timeout errors in the "ServiceAccounts" test from comment #1 and also in the "Image change build triggers" tests:

- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/177
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/180

Comment 21 Tomáš Nožička 2019-09-17 08:54:19 UTC
*** Bug 1752591 has been marked as a duplicate of this bug. ***

Comment 22 Adam Kaplan 2019-09-17 14:14:48 UTC
*** Bug 1752811 has been marked as a duplicate of this bug. ***

Comment 23 David Eads 2019-09-17 14:47:52 UTC
Opened https://github.com/openshift/origin/pull/23813 .  It looks like pods are slow getting cleaned up by kubelet and we backoff for 5 minutes which causes us to miss the check window in e2e.

Comment 24 Robert Krawitz 2019-09-18 14:15:37 UTC

*** This bug has been marked as a duplicate of bug 1752982 ***

Comment 25 Red Hat Bugzilla 2023-09-14 05:31:20 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days