Bug 1703875
Summary: | [stability] During e2e runs, some pods take multiple minutes to be terminated, causing namespace deletion test to flake 1/3 of the time | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | ||||||||
Component: | Storage | Assignee: | Bradley Childs <bchilds> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Liang Xia <lxia> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4.1.0 | CC: | aos-bugs, aos-storage-staff, bbennett, chaoyang, gblomqui, jokerman, lxia, mmccomas, wehe | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.1.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2019-06-04 10:48:10 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Clayton Coleman
2019-04-29 01:15:29 UTC
Seem like the pod is still trying to start but the secret it needs has been deleted. $ grep krlmt workers-journal | grep -v event.go Apr 27 14:39:35 ip-10-0-154-216 crio[922]: 2019-04-27T14:39:35Z [verbose] Add: e2e-tests-nsdeletetest-krlmt:test-pod:openshift-sdn:eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/proc/36874/ns/net"}],"ips":[{"version":"4","interface":0,"address":"10.129.2.42/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.129.2.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}} Apr 27 14:39:44 ip-10-0-154-216 crio[922]: 2019-04-27T14:39:44Z [verbose] Add: e2e-tests-nsdeletetest-krlmt:test-pod-uninitialized:openshift-sdn:eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/proc/37208/ns/net"}],"ips":[{"version":"4","interface":0,"address":"10.129.2.43/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.129.2.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}} Apr 27 14:39:45 ip-10-0-154-216 hyperkube[990]: E0427 14:39:45.175154 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:39:45 ip-10-0-154-216 crio[922]: 2019-04-27T14:39:45Z [verbose] Del: e2e-tests-nsdeletetest-krlmt:test-pod:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"} Apr 27 14:39:45 ip-10-0-154-216 hyperkube[990]: E0427 14:39:45.769866 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:39:46 ip-10-0-154-216 hyperkube[990]: E0427 14:39:46.874511 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:39:48 ip-10-0-154-216 hyperkube[990]: E0427 14:39:48.895958 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:39:52 ip-10-0-154-216 hyperkube[990]: E0427 14:39:52.911273 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:40:00 ip-10-0-154-216 hyperkube[990]: E0427 14:40:00.936053 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:40:16 ip-10-0-154-216 hyperkube[990]: E0427 14:40:16.987451 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:40:49 ip-10-0-154-216 hyperkube[990]: E0427 14:40:49.090840 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:41:48 ip-10-0-154-216 hyperkube[990]: E0427 14:41:48.111090 990 kubelet.go:1687] Unable to mount volumes for pod "test-pod-uninitialized_e2e-tests-nsdeletetest-krlmt(48403558-68fa-11e9-b949-0acc20960aea)": timeout expired waiting for volumes to attach or mount for pod "e2e-tests-nsdeletetest-krlmt"/"test-pod-uninitialized". list of unmounted volumes=[default-token-bbjvq]. list of unattached volumes=[default-token-bbjvq]; skipping pod Apr 27 14:41:48 ip-10-0-154-216 hyperkube[990]: E0427 14:41:48.111173 990 pod_workers.go:190] Error syncing pod 48403558-68fa-11e9-b949-0acc20960aea ("test-pod-uninitialized_e2e-tests-nsdeletetest-krlmt(48403558-68fa-11e9-b949-0acc20960aea)"), skipping: timeout expired waiting for volumes to attach or mount for pod "e2e-tests-nsdeletetest-krlmt"/"test-pod-uninitialized". list of unmounted volumes=[default-token-bbjvq]. list of unattached volumes=[default-token-bbjvq] Apr 27 14:41:48 ip-10-0-154-216 crio[922]: 2019-04-27T14:41:48Z [verbose] Del: e2e-tests-nsdeletetest-krlmt:test-pod-uninitialized:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}Apr 27 14:41:50 Apr 27 Apr Apr 27 14:41:50 ip-10-0-154-216 hyperkube[990]: E0427 14:41:50.391419 990 kubelet_pods.go:1088] Failed killing the pod "test-pod-uninitialized": failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = Unknown desc = container with ID starting with 3d27d978c889a8b86c77e6c6661d8f65848ebcccca08fd4fab3f90e2e4ab30cc not found: ID does not exist" kubelet's syncPod will periodically try to mount/remount volumes unless the pod is terminated by calling volumeManager.WaitForAttachAndMount(pod): https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L1682 where terminated in this case we expect to mean "(pod.DeletionTimestamp != nil && notRunning(status.ContainerStatuses))": https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet_pods.go#L864 But, if volumeManager.WaitForAttachAndMount(pod) fails, syncPod returns an error and containerRuntime.SyncPod never gets called, which I think means the container status never gets updated (someone please confirm), and so we will repeatedly try and fail to volumeManager.WaitForAttachAndMount(pod). I think the solution is to loosen the condition for calling volumeManager.WaitForAttachAndMount(pod) from (pod.DeletionTimestamp != nil && notRunning(status.ContainerStatuses)) to just pod.DeletionTimestamp != nil even if it's still running, if the intention is to delete it, there is no reason to ensure volumes are mounted. NVM all that, the case where deletionTimestamp isn't nil should already be caught a few lines above: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L1577, so the WaitForAttachAndMount failures are not blocking anything. From kubelet's perspective, the pod has no deletionTimestamp on it so it never tries to kill the pod and instead tries to keep syncing it, ensure volumes are mounted ,etc. In https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/327#openshift-tests-sig-cli-kubectl-client-k8sio-simple-pod-should-contain-last-line-of-the-log-suiteopenshiftconformanceparallel-suitek8s, there aren't any storage-related errors. There is this crio msg just before/around when the namespace is deleted, then there aren't any messages about this pod for the next ~10m... Apr 27 22:55:48 ip-10-0-143-119 crio[921]: 2019-04-27T22:55:48Z [verbose] Del: e2e-tests-kubectl-ntk9f:run-log-test:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"} Then unmount is finally attempted... Apr 27 23:06:29 ip-10-0-143-119 hyperkube[987]: I0427 23:06:29.505175 987 reconciler.go:181] operationExecutor.UnmountVolume started for volume "default-token-qrgkn" (UniqueName: "kubernetes.io/secret/0bab8a36-6941-11e9-a0db-0add3357e498-default-token-qrgkn") pod "0bab8a36-6941-11e9-a0db-0add3357e498" (UID: "0bab8a36-6941-11e9-a0db-0add3357e498") Created attachment 1561998 [details] v=4 kubelet logs Reproduced the e2e-aws-serial case, attached are v=4 kubelet logs. I think the issue is in kubelet when the pod is deleted right as its containers are starting. The e2e-aws-serial test is prone to this because it starts the pod basically in parallel with deleting the namespace: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/apimachinery/namespace.go#L148 / kubelet sees the API pod deletion but in the meantime the containers start, then kubelet repeatedly refuses to delete the pod because the containers are running. Container doesn't die until 2min+ later, no idea why. ``` May 02 20:11:11 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:11.314688 84119 kubelet.go:1915] SyncLoop (ADD, "api"): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)" May 02 20:11:18 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:18.819251 84119 kubelet.go:1931] SyncLoop (DELETE, "api"): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)" May 02 20:11:19 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:19.093283 84119 kubelet.go:1960] SyncLoop (PLEG): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)", event: &pleg.PodLifecycleEvent{ID:"6df570c9-6d16-11e9-91ed-029042c002c0", Type:"ContainerStarted", Data:"25ca27efd02b23dfb16eb38293a045f336e43fb8a61453ff282479a8c0ce7c03"} May 02 20:11:19 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:19.104436 84119 kubelet_pods.go:912] Pod "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)" is terminated, but some containers have not been cleaned up: {ID:{Type:cri-o ID:25ca27efd02b23dfb16eb38293a045f336e43fb8a61453ff282479a8c0ce7c03} Name:nginx State:running CreatedAt:2019-05-02 20:11:18.923659089 +0000 UTC StartedAt:2019-05-02 20:11:18.950548473 +0000 UTC FinishedAt:0001-01-01 00:00:00 +0000 UTC ExitCode:0 Image:k8s.gcr.io/pause:3.1 ImageID:k8s.gcr.io/pause@sha256:59eec8837a4d942cc19a52b8c09ea75121acc38114a2c68b98983ce9356b8610 Hash:4050409653 RestartCount:0 Reason: Message:} May 02 20:13:23 ip-10-0-131-11 hyperkube[84119]: I0502 20:13:23.357683 84119 kubelet.go:1960] SyncLoop (PLEG): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)", event: &pleg.PodLifecycleEvent{ID:"6df570c9-6d16-11e9-91ed-029042c002c0", Type:"ContainerDied", Data:"25ca27efd02b23dfb16eb38293a045f336e43fb8a61453ff282479a8c0ce7c03"} ``` Namespace was deleted around 16:11:13, timed out after 1m30s: ``` STEP: Creating an uninitialized pod in the namespace May 2 16:11:11.320: INFO: error from create uninitialized namespace: <nil> STEP: Deleting the namespace STEP: Waiting for the namespace to be removed. May 2 16:12:43.409: INFO: Unexpected error occurred: timed out waiting for the condition ``` Created attachment 1561999 [details]
test output
Created attachment 1562000 [details]
v=6 controller-manager logs
to sum up, I still don't know what is happening but now there are more verbose logs.
> Seem like the pod is still trying to start but the secret it needs has been deleted.
we can rule this out, because those log messages happen after the pod's container has started, meaning the *initial* mount succeeded. the logs are referring only to the periodic *re*mounts which continue for a while but are not the reason the pod can't be killed, kubelet is refusing to kill the pod for some other reason (it doesn't see the pod has a deletionTimestamp on it and/or the containers are still running).
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1703875 to unblock serial job. What I've missed so far is that the failing pod is always "uninitialized." The test seems to assume the pod won't start because it's uninitialized so it only gives 1m30s from the moment the pod is `Created` for the pod to start->stop which is too optimistic. This patch does not address the random failures in other non-serial tests, those still need digging into. https://github.com/openshift/origin/pull/22755 *** sorry for spam. Checked with builds from 780~784 in CI job https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/ and did not see the e2e fail/flake. Move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |