1713135 – Some namespaces aren't being cleaned up in e2e tests post-rebase

Bug 1713135 - Some namespaces aren't being cleaned up in e2e tests post-rebase

Summary: Some namespaces aren't being cleaned up in e2e tests post-rebase

Keywords:
Status:	CLOSED DUPLICATE of bug 1727090
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Seth Jennings
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-23 00:19 UTC by Clayton Coleman
Modified:	2019-08-12 16:13 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-12 16:13:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Clayton Coleman 2019-05-23 00:19:05 UTC

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/299#openshift-tests-sig-cli-kubectl-client-k8sio-simple-pod-should-contain-last-line-of-the-log-suiteopenshiftconformanceparallel-suitek8s

May 22 20:15:07.017: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
May 22 20:15:09.963: INFO: Couldn't delete ns: "kubectl-754": namespace kubectl-754 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace kubectl-754 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"})
May 22 20:15:09.965: INFO: Running AfterSuite actions on all nodes
May 22 20:15:09.965: INFO: Running AfterSuite actions on node 1

This is likely an issue with the namespace controller, needs triage.

Medium flake rate

Comment 1 Stefan Schimanski 2019-08-01 10:13:54 UTC

Similar errors are found very often:

https://search.svc.ci.openshift.org/?search=was+not+deleted+with+limit&maxAge=168h&context=2&type=all

Comment 2 Lukasz Szaszkiewicz 2019-08-06 13:55:08 UTC

I have been looking at this issue today. First and foremost, namespaces are deleted - `ns` controller that runs as part of `kcm` prints the status into the log file. I think we could lower the severity of the issue.

I have analyzed logs from a few test runs and it seems that the namespaces couldn't have been deleted because there were some pending pods. This conflicts with the error msg from the tests "timed out waiting for the condition, namespace is empty but is not yet removed".
I think that `ns` controller was behind because it had tried many times and constantly increased the next sync period (backoff).

For example, let's put the events from https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/2161/pull-ci-openshift-installer-master-e2e-aws/7079/ on a timeline:

1. At 01:29:09.065 the test decided to destroy the namespace
2. ns controller tried to remove the namespace from 01:29:40 to 01:34:25, it couldn't due to "unexpected items still remain in namespace: e2e-test-build-webhooks-xvsrq for gvr: /v1, Resource=pods"
3. At 01:39:11.605 the test gave up - "Couldn't delete ns: "e2e-test-build-webhooks-xvsrq": namespace e2e-test-build-webhooks-xvsrq was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"
4. At 01:39:28 (next sync run) ns contoller removed the namespace - "namespace_controller.go:171] Namespace has been deleted e2e-test-build-webhooks-xvsrq"


I have opened https://github.com/openshift/origin/pull/23557 to see why the pods cannot be deleted.

Comment 4 Lukasz Szaszkiewicz 2019-08-12 07:39:59 UTC

I'm going to assign the issue to Node team as I haven’t found anything suspicious on the server-side and I think it has something to do with how kubelet or the underlying container runtime handles container creation/deletion/reporting. I think it is worthing knowing why some containers need sometimes more time to be deleted.


The issue can be easily reproduced by running "TestWebhook" test, for example "openshift-tests run openshift/conformance --dry-run | grep -E "\sTestWebhook\s" | openshift-tests run -f -"
I have also attached the logs from a faulty run where "pushbuild-2-build" and "pushbuild-1-build" weren't removed right away.

Comment 5 Seth Jennings 2019-08-12 16:13:46 UTC


*** This bug has been marked as a duplicate of bug 1727090 ***

Note You need to log in before you can comment on or make changes to this bug.