1814282 – Storage e2es leaving namespaces/pods around

Bug 1814282 - Storage e2es leaving namespaces/pods around

Summary: Storage e2es leaving namespaces/pods around

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	OpenShift Storage Bugzilla Bot
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1814241 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-17 14:47 UTC by Ben Parees
Modified:	2022-08-25 20:56 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-25 20:56:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24981	0	None	closed	Bug 1814282: Storage e2es leaving namespaces/pods around	2020-10-19 21:56:09 UTC

Description Ben Parees 2020-03-17 14:47:48 UTC

Description of problem:
After running our e2e suite on a cluster, various e2e namespaces+pods are left around from the storage tests, in a "terminating" state:


oc get projects | grep e2e |  awk '{print $1}'
e2e-csi-mock-volumes-3476
e2e-csi-mock-volumes-7010
e2e-provisioning-396
e2e-provisioning-6966
e2e-provisioning-7022
e2e-provisioning-7859
e2e-snapshotting-2162
e2e-snapshotting-6772
e2e-snapshotting-8998
e2e-snapshotting-9284
e2e-volume-expand-2690



These tests need to be fixed to fully clean themselves up if they are leaving resources around with finalizers which prevent the namespace from being deleted by the general cleanup logic.


This is being set to urgent as it's unacceptable for namespaces to get  hung in terminating, but if investigation determines the issue is a rogue finalizer that's introduced by the test itself, the severity can be reduced, but we still need it to be possible to run e2es w/o leaving junk around on the cluster.

Comment 1 Christian Huffman 2020-03-17 14:58:51 UTC

This may be a duplicate to https://bugzilla.redhat.com/show_bug.cgi?id=1808123

Comment 2 Christian Huffman 2020-03-17 20:23:47 UTC

I've been running e2e conformance/parallel tests on my cluster, and the only namespace I see left behind are due to VolumeSnapshots that aren't cleaned up. The PR in https://bugzilla.redhat.com/show_bug.cgi?id=1808123 addresses that issue.

Is it possible to be given access to a cluster that has these namespaces still around?

Comment 5 Ben Parees 2020-03-18 14:59:25 UTC

given the focus around this, can we get regular updates on the efforts to address it?

Thanks

Comment 6 Christian Huffman 2020-03-18 15:04:39 UTC

The provisioning/snapshotting namespaces is caused by a known issue in https://bugzilla.redhat.com/show_bug.cgi?id=1808123 . There's an open PR at https://github.com/kubernetes-csi/external-snapshotter/pull/275 which will address this, and I'm working on getting it merged upstream.

I'm not certain the source of the pvc-protection Finalizer remaining on the PVCs/Pods yet, and have reached out to other members of the team.

Comment 8 Hemant Kumar 2020-03-18 18:34:32 UTC

The csi-mock pod is not deletable because it looks like CSI driver gets deleted before pod had a chance to be deleted. This causes `UnmountVolume` calls to fail with grpc errors and hence pod remains stuck in terminating stuck.

It looks like we are not attempting to even delete the pod when this happens. Some parts of e2e do rely on namespace deletion triggering pod deletion but this does not work for pods that are using CSI volumes because driver will be removed before namespace can be deleted and hence pod deletion will not work.

I am looking into which pod specifically did not get deleted via e2e but I think we should in general reconsider how cleanup works for these tests.

Comment 9 Hemant Kumar 2020-03-18 23:47:11 UTC

for csi-mock tests the test run is very interesting:

The pod that was started for testing was started on:

Mar 12 22:07:53 ip-10-0-130-231 hyperkube[1644]: I0312 22:07:53.680591    1644 volume_manager.go:387] All volumes are attached and mounted for pod "pvc-volume-tester-ccdlh_e2e-csi-mock-volumes-7004(89980b6e-10a9-4f1c-8ba5-bf8e5f01ba6a)"


But it does not get deleted until:

Mar 14 04:05:06 ip-10-0-130-231 hyperkube[1644]: I0314 04:05:06.151453    1644 kubelet.go:1924] SyncLoop (DELETE, "api"): "pvc-volume-tester-ccdlh_e2e-csi-mock-volumes-7004(89980b6e-10a9-4f1c-8ba5-bf8e5f01ba6a)"

So something tells me the e2e binary crashed without running AfterEach block properly and then someone tried to delete the namespace manually? Looking at the test it does look like the test was OOM killed:

error: 54 fail, 904 pass, 1332 skip (1h32m47s)
2020/03/18 07:37:02 Container test in pod endurance-e2e-aws failed, exit code 1, reason OOMKilled
2020/03/18 07:37:09 Copied 263.17MB of artifacts from endurance-e2e-aws to /logs/artifacts/endurance-test
2020/03/18 07:37:09 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/03/18 07:37:10 Ran for 1h36m16s
error: could not run steps: step endurance-test failed: template pod "endurance-e2e-aws" failed: the pod ci-op-r3xrdw8b/endurance-e2e-aws failed after 1h34m40s (failed containers: test): ContainerFailed one or more containers exited

Container test exited with code 1, reason OOMKilled


For CSI e2e tests since driver and pods that use CSI volume are in same namespace, if you delete the namespace - the driver could be deleted before the pod that was using the volume and hence Unmount for pod will fail and pod will fail to delete. This is working as expected at least for mock tests.  For snapshot tests there is a separate BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1808123 


I do not think this presents a critical bug in openshift (apart from snapshot stuff).

Comment 10 Ben Parees 2020-03-19 01:00:15 UTC

it looks like the endurance e2e template was using lower resource settings than the general e2e template (the general e2e must have been bumped at some point).  I will raise the endurance template accordingly which may help here.

https://github.com/openshift/release/pull/7773

Can you provide steps to clean up the endurance cluster?  I would prefer not to tear it down if possible.

Comment 11 Clayton Coleman 2020-03-19 01:18:11 UTC

How would a customer recover a broken node like this?  If a CSI driver got evicted, what would we recommend?  If the test is launching a CSI driver as a pod, it should stop doing that and instead use a daemonset on the node.  The daemonset should declare a dependency on the test pods (or some other artifact that requires them to be torn down before the CSI driver is torn down.

Comment 13 Jan Safranek 2020-03-19 09:57:48 UTC

> How would a customer recover a broken node like this?

Any serious CSI driver runs daemonsets + appropriate priority class. In e2e we use csi-mock and csi-hostpath as regular pods though. This should be fixed.

Comment 14 Ryan Phillips 2020-03-19 15:28:25 UTC

*** Bug 1814393 has been marked as a duplicate of this bug. ***

Comment 16 W. Trevor King 2020-04-07 19:49:30 UTC

*** Bug 1814241 has been marked as a duplicate of this bug. ***

Comment 17 W. Trevor King 2020-04-07 19:56:36 UTC

Marking this as an update blocker, because it can leave nodes in SchedulingDisabled and block further deployment rollout (more on that in bug 1814241). Repeating-ish the impact-statement request from [1]:

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context. The UpgradeBlocker flag has been added to this bug, it will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking for which cluster flavors?
Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact? Is it serious enough to warrant blocking edges?
Up to 2 minute disruption in edge routing
Up to 90seconds of API downtime
etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
Issue resolves itself after five minutes
Admin uses oc to fix things
Admin must SSH to hosts, restore from backups, or other non standard admin activities

In [2], Ben lays out the impact of the SchedulingDisabled symptom:

Worker nodes become unscheduleable, meaning some daemonsets never achieve readiness again and some cluster capacity is unavailable. Not sure if this could also affect a master node, which would have a much bigger impact.

But the stuck-in-terminating namespace issue may have other impacts beyond SchedulingDisabled.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1814241#c1
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1814241#c2

Comment 18 Hemant Kumar 2020-04-08 04:18:02 UTC

1. Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking for which cluster flavors?

This is entirely a e2e setup issue and afaik no user should be impacted.

2. What is the impact?  Is it serious enough to warrant blocking edges?

Same as above

3. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

The remediation is to delete the pod that is stuck in "terminating" state. 

But anyways to fix the e2e issues I have opened - https://github.com/kubernetes/kubernetes/pull/89944

Just to be clear - this will not *completely* fix the issue but what it does is, gives testsuite to cleanup itself if e2e process is killed.

Comment 19 Lalatendu Mohanty 2020-04-08 12:47:29 UTC

As this issue not going to impact customers and only have impact on e2e tests, I am adding testblocker keyword and removing upgradeblocker

Comment 20 Lalatendu Mohanty 2020-04-08 12:50:54 UTC

As this issue not going to impact customers and only have impact on e2e tests, I am adding testblocker keyword and removing upgradeblocker

Comment 21 Hemant Kumar 2020-04-08 15:52:21 UTC

Just to clarify - you have to force delete the pod in terminating state via:

~> oc delete pod <> --grace-period=0 --force

Comment 23 Jan Safranek 2020-04-09 13:19:21 UTC

To summarize our investigation, it seems that openshift-tests died while running some CSI tests, without running defer() or AfterEach of the tests. Later on, something deleted e2e namespaces (it could be even openshift-tests itself reacting to SIGINT - its AfterSuite just blindly deletes namespaces without defers or AfterEach to be called, see https://github.com/onsi/ginkgo/issues/222).

Unfortunately, deleting CSI e2e namespaces can (and will) kill CSI driver pods in the namespace *before* test pods that use volumes. Kubelet can't unmount CSI volumes of these test pods and keeps them Terminating forever -> whole namespace is Terminating forever && node cannot be fully drained.

We considered using finalizers and a controller that would coordinate deletion of pods in a namespace so test pods are deleted first and CSI drivers after that. It ended up quite complicated and possibly error prone for a simple e2e test. Similarly complicated is using two namespaces (CSI driver + tests) and deleting them in defined order.

Ideas how to kill pods in a namespace in defined order are welcome. Ensuring that openshift-tests does not die looks like the best solution.

Comment 24 Jan Safranek 2020-04-21 11:31:09 UTC

I checked endurance-4.4 jobs, there are no controller-manager nor API server logs available in the job results, so we can't check what went wrong during the tests.

Adding must-gather (and other log gathering) to endurance jobs: https://github.com/openshift/release/pull/8448

Comment 25 Hemant Kumar 2020-05-01 21:27:22 UTC

This is kind of *very* hard to fix at least in a way that will ensure that testpods are *always* deleted before driver pod dies.  I have tried several more things today:

1. I put a finalizer on statefulset/daemonset that creates the driver pod, but when corresponding namespace is deleted then even though statefulset deletion is blocked, the driver pod still gets deleted. So that does not fixes the problem.
2. Next we put finalizer on driver pod directly. But when namespace is deleted then container that runs driver still gets killed even though pod has a finalizer - https://github.com/kubernetes/kubernetes/issues/73584

So, what we can do next? May be we can put a pre-stop hook in the pod and have it check for finalizer on running driver pod before allowing container to be stopped. This will require I think rebuilding some of driver images.

Comment 26 Hemant Kumar 2020-05-01 21:33:53 UTC

Looking at Endurance run tests - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/endurance-e2e-aws-4.4/42 I can still see that run is being interrupted.

Received interrupt.  Running AfterSuite...
^C again to terminate immediately
Apr 29 06:55:52.710: INFO: Running AfterSuite actions on all nodes
Apr 29 06:55:52.710: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-provisioning-3002" for this suite.
Apr 29 06:55:52.814: INFO: Running AfterSuite actions on node 1

Do we know who/what is sending those interrupts?

Comment 27 Ben Parees 2020-05-02 00:05:31 UTC

> Do we know who/what is sending those interrupts?



discussed in slack but putting here for anyone else who comes along:  this appears to be the test suite hitting the 15 minute ginkgo suite timeout.
https://coreos.slack.com/archives/CBQHQFU0N/p1588377361358900?thread_ts=1588198750.349800&cid=CBQHQFU0N

Comment 28 Ted Yu 2020-05-02 20:10:22 UTC

Is it possible to reduce the number of ginkgo.It's in the underlying TestSuite so that timeout is not reached ?

Comment 29 Hemant Kumar 2020-05-05 04:06:04 UTC

openshift-tests runs each It() in its own process, so reducing the number of Its is not going to help I think. It is just that depending on cluster, the timings can be variable..

BTW - I have pushed my changes which uses two namespaces. One for driver and another for test pods. The problem is, it is still somewhat tricky. The thing is - AfterEach and AfterSuite race with each other and because same function can be called via AfterEach and AfterSuite as well, one can set state of the suite to a state that makes another one to panic. This in itself is not a show stopper but it looks ugly. Also same cleanup function might be called twice because of the race. 

Ideally whenever AfterEach and AfterSuite are going to modify framework.Framework or any other shared state, they should be protected by locks. But that is very invasive change and might break other things. 

Here is my changes - https://github.com/gnufied/kubernetes/commits/fix-csi-e2e-orphans

Comment 30 Hemant Kumar 2020-05-06 20:15:11 UTC

Alright based on above branch - I have another working PR that fixes this issue for good - https://github.com/kubernetes/kubernetes/pull/90773 . Once it merges in upstream, I will backport this to origin.

Comment 31 Hemant Kumar 2020-05-27 15:44:01 UTC

This should not be a 4.5 blocker. The PR I have opened while does remove csi-mock volume pods/namespaces, there seems to be some other left over pods:

> fail [github.com/openshift/origin/test/extended/operators/cluster.go:65]: May 20 03:58:17.618: Resources still in terminating state: ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, pod/execpoddwjws, pod/execpoddwjws, pod/execpod2tczc, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, pod/execpodg79kj, pod/execpodg79kj, pod/execpodg79kj, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-
snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376, ns/e2e-snapshotting-4376

I am going to investigate those and fix them too. We could merge my openshift PR (after dropping the upstream commit which got dropped) and it will be slightly better than before and file a new bug for snapshotting namespaces being left around.

Comment 32 Lalatendu Mohanty 2020-06-02 12:39:46 UTC

Removing the Upgrades keyword as it does not impact customers.

Comment 33 W. Trevor King 2020-06-02 16:29:58 UTC

I'd added Upgrades in comment 17, because "the e2e suite leaks pods that can block drains which can block updates" applies to anyone who runs the e2e suite.  And there's nothing that says "only we can run the e2e suite", customers can run it on their clusters too.  But I'm fine if we want to address that via MCO failed-to-drain alerting instead of in each bug that could lead to drain failures.

Comment 34 Hemant Kumar 2020-06-02 16:48:14 UTC

I have splitted the PR that fixes defer and AfterSuite race in another PR in upstream - https://github.com/kubernetes/kubernetes/pull/91689

I need to update openshift PR once it is merged.

Comment 35 Hemant Kumar 2020-06-02 22:28:19 UTC

I investigated snapshot namespace being left in terminating state and filed - https://github.com/kubernetes/kubernetes/issues/91702 . I am just trying to confirm if they already fixed this in upstream or we will make a fix.

Comment 37 Hemant Kumar 2020-06-18 16:25:31 UTC

This PR should fix e2e-volume-expand and e2e-mock-volume namespaces left around. We will tackle snapshot e2es via https://bugzilla.redhat.com/show_bug.cgi?id=1848633

Comment 38 Hemant Kumar 2020-06-18 21:59:24 UTC

We are hoping that https://github.com/openshift/origin/pull/24981 will merge soon and we will be able to close this bug.

Comment 41 Chao Yang 2020-07-06 01:17:38 UTC

Passed on 4.6.0-0.nightly-2020-06-30-020342
Create namespace and create some volumesnapshot. Then delete ns successfully

Note You need to log in before you can comment on or make changes to this bug.