Bug 1703875 - [stability] During e2e runs, some pods take multiple minutes to be terminated, causing namespace deletion test to flake 1/3 of the time [NEEDINFO]
Summary: [stability] During e2e runs, some pods take multiple minutes to be terminated...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: ---
: 4.1.0
Assignee: Bradley Childs
QA Contact: Liang Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-29 01:15 UTC by Clayton Coleman
Modified: 2019-06-04 10:48 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:48:10 UTC
Target Upstream Version:
lxia: needinfo? (mawong)


Attachments (Terms of Use)
v=4 kubelet logs (70.81 KB, text/plain)
2019-05-02 21:10 UTC, Matthew Wong
no flags Details
test output (5.29 KB, text/plain)
2019-05-02 21:10 UTC, Matthew Wong
no flags Details
v=6 controller-manager logs (1.90 MB, text/plain)
2019-05-02 21:11 UTC, Matthew Wong
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:48:17 UTC

Description Clayton Coleman 2019-04-29 01:15:29 UTC
About 1/3 or 1/4 e2e-aws-serial tests fail due to the following test

Namespaces [Serial] should ensure that all pods are removed when a namespace is deleted [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]

fail [k8s.io/kubernetes/test/e2e/apimachinery/namespace.go:161]: Expected error:
    <*errors.errorString | 0xc0002be3c0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
		
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/324#openshift-tests-sig-api-machinery-namespaces-serial-should-ensure-that-all-pods-are-removed-when-a-namespace-is-deleted-conformance-suiteopenshiftconformanceserialminimal-suitek8s, but occurs in most jobs and sometimes flakes.

The controller manager had many lines like:

E0427 14:41:47.133745       1 namespace_controller.go:148] unexpected items still remain in namespace: e2e-tests-nsdeletetest-krlmt for gvr: /v1, Resource=pods

and it took more than 10 for the pod or pods to be deleted, which indicates a possible node hang.  This also occurs in non-serial tests - for instance, in this e2e-aws run:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/327#openshift-tests-sig-cli-kubectl-client-k8sio-simple-pod-should-contain-last-line-of-the-log-suiteopenshiftconformanceparallel-suitek8s the namespace e2e-tests-kubectl-ntk9f took more than 10 minutes to delete and the namespace controller was waiting for pods to disappear.

Marking as urgent until we know why this is happening, whether it's due to a bad aggregated api or a more subtle bug.

Comment 1 Seth Jennings 2019-04-30 13:42:11 UTC
Seem like the pod is still trying to start but the secret it needs has been deleted.

$ grep krlmt workers-journal  | grep -v event.go
Apr 27 14:39:35 ip-10-0-154-216 crio[922]: 2019-04-27T14:39:35Z [verbose] Add: e2e-tests-nsdeletetest-krlmt:test-pod:openshift-sdn:eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/proc/36874/ns/net"}],"ips":[{"version":"4","interface":0,"address":"10.129.2.42/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.129.2.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}}
Apr 27 14:39:44 ip-10-0-154-216 crio[922]: 2019-04-27T14:39:44Z [verbose] Add: e2e-tests-nsdeletetest-krlmt:test-pod-uninitialized:openshift-sdn:eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/proc/37208/ns/net"}],"ips":[{"version":"4","interface":0,"address":"10.129.2.43/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.129.2.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}}
Apr 27 14:39:45 ip-10-0-154-216 hyperkube[990]: E0427 14:39:45.175154     990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found
Apr 27 14:39:45 ip-10-0-154-216 crio[922]: 2019-04-27T14:39:45Z [verbose] Del: e2e-tests-nsdeletetest-krlmt:test-pod:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}
Apr 27 14:39:45 ip-10-0-154-216 hyperkube[990]: E0427 14:39:45.769866     990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found
Apr 27 14:39:46 ip-10-0-154-216 hyperkube[990]: E0427 14:39:46.874511     990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found
Apr 27 14:39:48 ip-10-0-154-216 hyperkube[990]: E0427 14:39:48.895958     990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found
Apr 27 14:39:52 ip-10-0-154-216 hyperkube[990]: E0427 14:39:52.911273     990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found
Apr 27 14:40:00 ip-10-0-154-216 hyperkube[990]: E0427 14:40:00.936053     990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found
Apr 27 14:40:16 ip-10-0-154-216 hyperkube[990]: E0427 14:40:16.987451     990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found
Apr 27 14:40:49 ip-10-0-154-216 hyperkube[990]: E0427 14:40:49.090840     990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found
Apr 27 14:41:48 ip-10-0-154-216 hyperkube[990]: E0427 14:41:48.111090     990 kubelet.go:1687] Unable to mount volumes for pod "test-pod-uninitialized_e2e-tests-nsdeletetest-krlmt(48403558-68fa-11e9-b949-0acc20960aea)": timeout expired waiting for volumes to attach or mount for pod "e2e-tests-nsdeletetest-krlmt"/"test-pod-uninitialized". list of unmounted volumes=[default-token-bbjvq]. list of unattached volumes=[default-token-bbjvq]; skipping pod
Apr 27 14:41:48 ip-10-0-154-216 hyperkube[990]: E0427 14:41:48.111173     990 pod_workers.go:190] Error syncing pod 48403558-68fa-11e9-b949-0acc20960aea ("test-pod-uninitialized_e2e-tests-nsdeletetest-krlmt(48403558-68fa-11e9-b949-0acc20960aea)"), skipping: timeout expired waiting for volumes to attach or mount for pod "e2e-tests-nsdeletetest-krlmt"/"test-pod-uninitialized". list of unmounted volumes=[default-token-bbjvq]. list of unattached volumes=[default-token-bbjvq]
Apr 27 14:41:48 ip-10-0-154-216 crio[922]: 2019-04-27T14:41:48Z [verbose] Del: e2e-tests-nsdeletetest-krlmt:test-pod-uninitialized:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}Apr 27 14:41:50 Apr 27 Apr Apr 27 14:41:50 ip-10-0-154-216 hyperkube[990]: E0427 14:41:50.391419     990 kubelet_pods.go:1088] Failed killing the pod "test-pod-uninitialized": failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = Unknown desc = container with ID starting with 3d27d978c889a8b86c77e6c6661d8f65848ebcccca08fd4fab3f90e2e4ab30cc not found: ID does not exist"

Comment 3 Matthew Wong 2019-04-30 18:03:36 UTC
kubelet's syncPod will periodically try to mount/remount volumes unless the pod is terminated by calling volumeManager.WaitForAttachAndMount(pod):
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L1682

where terminated in this case we expect to mean "(pod.DeletionTimestamp != nil && notRunning(status.ContainerStatuses))":
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet_pods.go#L864

But, if volumeManager.WaitForAttachAndMount(pod) fails, syncPod returns an error and containerRuntime.SyncPod never gets called, which I think means the container status never gets updated (someone please confirm), and so we will repeatedly try and fail to volumeManager.WaitForAttachAndMount(pod).

I think the solution is to loosen the condition for calling volumeManager.WaitForAttachAndMount(pod) from
(pod.DeletionTimestamp != nil && notRunning(status.ContainerStatuses))
to just
pod.DeletionTimestamp != nil
even if it's still running, if the intention is to delete it, there is no reason to ensure volumes are mounted.

Comment 4 Matthew Wong 2019-04-30 22:12:36 UTC
NVM all that, the case where deletionTimestamp isn't nil should already be caught a few lines above: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L1577, so the WaitForAttachAndMount failures are not blocking anything.

From kubelet's perspective, the pod has no deletionTimestamp on it so it never tries to kill the pod and instead tries to keep syncing it, ensure volumes are mounted ,etc.

In https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/327#openshift-tests-sig-cli-kubectl-client-k8sio-simple-pod-should-contain-last-line-of-the-log-suiteopenshiftconformanceparallel-suitek8s, there aren't any storage-related errors. 

There is this crio msg just before/around when the namespace is deleted, then there aren't any messages about this pod for the next ~10m...
Apr 27 22:55:48 ip-10-0-143-119 crio[921]: 2019-04-27T22:55:48Z [verbose] Del: e2e-tests-kubectl-ntk9f:run-log-test:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}

Then unmount is finally attempted...
Apr 27 23:06:29 ip-10-0-143-119 hyperkube[987]: I0427 23:06:29.505175     987 reconciler.go:181] operationExecutor.UnmountVolume started for volume "default-token-qrgkn" (UniqueName: "kubernetes.io/secret/0bab8a36-6941-11e9-a0db-0add3357e498-default-token-qrgkn") pod "0bab8a36-6941-11e9-a0db-0add3357e498" (UID: "0bab8a36-6941-11e9-a0db-0add3357e498")

Comment 5 Matthew Wong 2019-05-02 21:10:07 UTC
Created attachment 1561998 [details]
v=4 kubelet logs

Reproduced the e2e-aws-serial case, attached are v=4 kubelet logs.

I think the issue is in kubelet when the pod is deleted right as its containers are starting. The e2e-aws-serial test is prone to this because it starts the pod basically in parallel with deleting the namespace: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/apimachinery/namespace.go#L148 /  kubelet sees the API pod deletion but in the meantime the containers start, then kubelet repeatedly refuses to delete the pod because the containers are running. Container doesn't die until 2min+ later, no idea why.

```
May 02 20:11:11 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:11.314688   84119 kubelet.go:1915] SyncLoop (ADD, "api"): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)"

May 02 20:11:18 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:18.819251   84119 kubelet.go:1931] SyncLoop (DELETE, "api"): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)"

May 02 20:11:19 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:19.093283   84119 kubelet.go:1960] SyncLoop (PLEG): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)", event: &pleg.PodLifecycleEvent{ID:"6df570c9-6d16-11e9-91ed-029042c002c0", Type:"ContainerStarted", Data:"25ca27efd02b23dfb16eb38293a045f336e43fb8a61453ff282479a8c0ce7c03"}

May 02 20:11:19 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:19.104436   84119 kubelet_pods.go:912] Pod "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)" is terminated, but some containers have not been cleaned up: {ID:{Type:cri-o ID:25ca27efd02b23dfb16eb38293a045f336e43fb8a61453ff282479a8c0ce7c03} Name:nginx State:running CreatedAt:2019-05-02 20:11:18.923659089 +0000 UTC StartedAt:2019-05-02 20:11:18.950548473 +0000 UTC FinishedAt:0001-01-01 00:00:00 +0000 UTC ExitCode:0 Image:k8s.gcr.io/pause:3.1 ImageID:k8s.gcr.io/pause@sha256:59eec8837a4d942cc19a52b8c09ea75121acc38114a2c68b98983ce9356b8610 Hash:4050409653 RestartCount:0 Reason: Message:}

May 02 20:13:23 ip-10-0-131-11 hyperkube[84119]: I0502 20:13:23.357683   84119 kubelet.go:1960] SyncLoop (PLEG): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)", event: &pleg.PodLifecycleEvent{ID:"6df570c9-6d16-11e9-91ed-029042c002c0", Type:"ContainerDied", Data:"25ca27efd02b23dfb16eb38293a045f336e43fb8a61453ff282479a8c0ce7c03"}
```

Namespace was deleted around 16:11:13, timed out after 1m30s:
```
STEP: Creating an uninitialized pod in the namespace
May  2 16:11:11.320: INFO: error from create uninitialized namespace: <nil>
STEP: Deleting the namespace
STEP: Waiting for the namespace to be removed.
May  2 16:12:43.409: INFO: Unexpected error occurred: timed out waiting for the condition
```

Comment 6 Matthew Wong 2019-05-02 21:10:42 UTC
Created attachment 1561999 [details]
test output

Comment 7 Matthew Wong 2019-05-02 21:11:34 UTC
Created attachment 1562000 [details]
v=6 controller-manager logs

Comment 8 Matthew Wong 2019-05-02 21:34:54 UTC
to sum up, I still don't know what is happening but now there are more verbose logs.

> Seem like the pod is still trying to start but the secret it needs has been deleted.

we can rule this out, because those log messages happen after the pod's container has started, meaning the *initial* mount succeeded. the logs are referring only to the periodic *re*mounts which continue for a while but are not the reason the pod can't be killed, kubelet is refusing to kill the pod for some other reason (it doesn't see the pod has a deletionTimestamp on it and/or the containers are still running).

Comment 9 Matthew Wong 2019-05-03 01:16:44 UTC
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1703875 to unblock serial job.

What I've missed so far is that the failing pod is always "uninitialized." The test seems to assume the pod won't start because it's uninitialized so it only gives 1m30s from the moment the pod is `Created` for the pod to start->stop which is too optimistic.

This patch does not address the random failures in other non-serial tests, those still need digging into.

Comment 10 Matthew Wong 2019-05-03 01:19:23 UTC
https://github.com/openshift/origin/pull/22755 ***

sorry for spam.

Comment 16 Liang Xia 2019-05-13 03:31:17 UTC
Checked with builds from 780~784 in CI job https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/
and did not see the e2e fail/flake.

Move to verified.

Comment 18 errata-xmlrpc 2019-06-04 10:48:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.