Hide Forgot
About 1/3 or 1/4 e2e-aws-serial tests fail due to the following test Namespaces [Serial] should ensure that all pods are removed when a namespace is deleted [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s] fail [k8s.io/kubernetes/test/e2e/apimachinery/namespace.go:161]: Expected error: <*errors.errorString | 0xc0002be3c0>: { s: "timed out waiting for the condition", } timed out waiting for the condition not to have occurred https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/324#openshift-tests-sig-api-machinery-namespaces-serial-should-ensure-that-all-pods-are-removed-when-a-namespace-is-deleted-conformance-suiteopenshiftconformanceserialminimal-suitek8s, but occurs in most jobs and sometimes flakes. The controller manager had many lines like: E0427 14:41:47.133745 1 namespace_controller.go:148] unexpected items still remain in namespace: e2e-tests-nsdeletetest-krlmt for gvr: /v1, Resource=pods and it took more than 10 for the pod or pods to be deleted, which indicates a possible node hang. This also occurs in non-serial tests - for instance, in this e2e-aws run: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/327#openshift-tests-sig-cli-kubectl-client-k8sio-simple-pod-should-contain-last-line-of-the-log-suiteopenshiftconformanceparallel-suitek8s the namespace e2e-tests-kubectl-ntk9f took more than 10 minutes to delete and the namespace controller was waiting for pods to disappear. Marking as urgent until we know why this is happening, whether it's due to a bad aggregated api or a more subtle bug.
Seem like the pod is still trying to start but the secret it needs has been deleted. $ grep krlmt workers-journal | grep -v event.go Apr 27 14:39:35 ip-10-0-154-216 crio[922]: 2019-04-27T14:39:35Z [verbose] Add: e2e-tests-nsdeletetest-krlmt:test-pod:openshift-sdn:eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/proc/36874/ns/net"}],"ips":[{"version":"4","interface":0,"address":"10.129.2.42/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.129.2.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}} Apr 27 14:39:44 ip-10-0-154-216 crio[922]: 2019-04-27T14:39:44Z [verbose] Add: e2e-tests-nsdeletetest-krlmt:test-pod-uninitialized:openshift-sdn:eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/proc/37208/ns/net"}],"ips":[{"version":"4","interface":0,"address":"10.129.2.43/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.129.2.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}} Apr 27 14:39:45 ip-10-0-154-216 hyperkube[990]: E0427 14:39:45.175154 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:39:45 ip-10-0-154-216 crio[922]: 2019-04-27T14:39:45Z [verbose] Del: e2e-tests-nsdeletetest-krlmt:test-pod:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"} Apr 27 14:39:45 ip-10-0-154-216 hyperkube[990]: E0427 14:39:45.769866 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:39:46 ip-10-0-154-216 hyperkube[990]: E0427 14:39:46.874511 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:39:48 ip-10-0-154-216 hyperkube[990]: E0427 14:39:48.895958 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:39:52 ip-10-0-154-216 hyperkube[990]: E0427 14:39:52.911273 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:40:00 ip-10-0-154-216 hyperkube[990]: E0427 14:40:00.936053 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:40:16 ip-10-0-154-216 hyperkube[990]: E0427 14:40:16.987451 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:40:49 ip-10-0-154-216 hyperkube[990]: E0427 14:40:49.090840 990 secret.go:194] Couldn't get secret e2e-tests-nsdeletetest-krlmt/default-token-bbjvq: secrets "default-token-bbjvq" not found Apr 27 14:41:48 ip-10-0-154-216 hyperkube[990]: E0427 14:41:48.111090 990 kubelet.go:1687] Unable to mount volumes for pod "test-pod-uninitialized_e2e-tests-nsdeletetest-krlmt(48403558-68fa-11e9-b949-0acc20960aea)": timeout expired waiting for volumes to attach or mount for pod "e2e-tests-nsdeletetest-krlmt"/"test-pod-uninitialized". list of unmounted volumes=[default-token-bbjvq]. list of unattached volumes=[default-token-bbjvq]; skipping pod Apr 27 14:41:48 ip-10-0-154-216 hyperkube[990]: E0427 14:41:48.111173 990 pod_workers.go:190] Error syncing pod 48403558-68fa-11e9-b949-0acc20960aea ("test-pod-uninitialized_e2e-tests-nsdeletetest-krlmt(48403558-68fa-11e9-b949-0acc20960aea)"), skipping: timeout expired waiting for volumes to attach or mount for pod "e2e-tests-nsdeletetest-krlmt"/"test-pod-uninitialized". list of unmounted volumes=[default-token-bbjvq]. list of unattached volumes=[default-token-bbjvq] Apr 27 14:41:48 ip-10-0-154-216 crio[922]: 2019-04-27T14:41:48Z [verbose] Del: e2e-tests-nsdeletetest-krlmt:test-pod-uninitialized:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}Apr 27 14:41:50 Apr 27 Apr Apr 27 14:41:50 ip-10-0-154-216 hyperkube[990]: E0427 14:41:50.391419 990 kubelet_pods.go:1088] Failed killing the pod "test-pod-uninitialized": failed to "KillContainer" for "nginx" with KillContainerError: "rpc error: code = Unknown desc = container with ID starting with 3d27d978c889a8b86c77e6c6661d8f65848ebcccca08fd4fab3f90e2e4ab30cc not found: ID does not exist"
kubelet's syncPod will periodically try to mount/remount volumes unless the pod is terminated by calling volumeManager.WaitForAttachAndMount(pod): https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L1682 where terminated in this case we expect to mean "(pod.DeletionTimestamp != nil && notRunning(status.ContainerStatuses))": https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet_pods.go#L864 But, if volumeManager.WaitForAttachAndMount(pod) fails, syncPod returns an error and containerRuntime.SyncPod never gets called, which I think means the container status never gets updated (someone please confirm), and so we will repeatedly try and fail to volumeManager.WaitForAttachAndMount(pod). I think the solution is to loosen the condition for calling volumeManager.WaitForAttachAndMount(pod) from (pod.DeletionTimestamp != nil && notRunning(status.ContainerStatuses)) to just pod.DeletionTimestamp != nil even if it's still running, if the intention is to delete it, there is no reason to ensure volumes are mounted.
NVM all that, the case where deletionTimestamp isn't nil should already be caught a few lines above: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go#L1577, so the WaitForAttachAndMount failures are not blocking anything. From kubelet's perspective, the pod has no deletionTimestamp on it so it never tries to kill the pod and instead tries to keep syncing it, ensure volumes are mounted ,etc. In https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/327#openshift-tests-sig-cli-kubectl-client-k8sio-simple-pod-should-contain-last-line-of-the-log-suiteopenshiftconformanceparallel-suitek8s, there aren't any storage-related errors. There is this crio msg just before/around when the namespace is deleted, then there aren't any messages about this pod for the next ~10m... Apr 27 22:55:48 ip-10-0-143-119 crio[921]: 2019-04-27T22:55:48Z [verbose] Del: e2e-tests-kubectl-ntk9f:run-log-test:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"} Then unmount is finally attempted... Apr 27 23:06:29 ip-10-0-143-119 hyperkube[987]: I0427 23:06:29.505175 987 reconciler.go:181] operationExecutor.UnmountVolume started for volume "default-token-qrgkn" (UniqueName: "kubernetes.io/secret/0bab8a36-6941-11e9-a0db-0add3357e498-default-token-qrgkn") pod "0bab8a36-6941-11e9-a0db-0add3357e498" (UID: "0bab8a36-6941-11e9-a0db-0add3357e498")
Created attachment 1561998 [details] v=4 kubelet logs Reproduced the e2e-aws-serial case, attached are v=4 kubelet logs. I think the issue is in kubelet when the pod is deleted right as its containers are starting. The e2e-aws-serial test is prone to this because it starts the pod basically in parallel with deleting the namespace: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/apimachinery/namespace.go#L148 / kubelet sees the API pod deletion but in the meantime the containers start, then kubelet repeatedly refuses to delete the pod because the containers are running. Container doesn't die until 2min+ later, no idea why. ``` May 02 20:11:11 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:11.314688 84119 kubelet.go:1915] SyncLoop (ADD, "api"): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)" May 02 20:11:18 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:18.819251 84119 kubelet.go:1931] SyncLoop (DELETE, "api"): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)" May 02 20:11:19 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:19.093283 84119 kubelet.go:1960] SyncLoop (PLEG): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)", event: &pleg.PodLifecycleEvent{ID:"6df570c9-6d16-11e9-91ed-029042c002c0", Type:"ContainerStarted", Data:"25ca27efd02b23dfb16eb38293a045f336e43fb8a61453ff282479a8c0ce7c03"} May 02 20:11:19 ip-10-0-131-11 hyperkube[84119]: I0502 20:11:19.104436 84119 kubelet_pods.go:912] Pod "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)" is terminated, but some containers have not been cleaned up: {ID:{Type:cri-o ID:25ca27efd02b23dfb16eb38293a045f336e43fb8a61453ff282479a8c0ce7c03} Name:nginx State:running CreatedAt:2019-05-02 20:11:18.923659089 +0000 UTC StartedAt:2019-05-02 20:11:18.950548473 +0000 UTC FinishedAt:0001-01-01 00:00:00 +0000 UTC ExitCode:0 Image:k8s.gcr.io/pause:3.1 ImageID:k8s.gcr.io/pause@sha256:59eec8837a4d942cc19a52b8c09ea75121acc38114a2c68b98983ce9356b8610 Hash:4050409653 RestartCount:0 Reason: Message:} May 02 20:13:23 ip-10-0-131-11 hyperkube[84119]: I0502 20:13:23.357683 84119 kubelet.go:1960] SyncLoop (PLEG): "test-pod-uninitialized_e2e-tests-nsdeletetest-pr6br(6df570c9-6d16-11e9-91ed-029042c002c0)", event: &pleg.PodLifecycleEvent{ID:"6df570c9-6d16-11e9-91ed-029042c002c0", Type:"ContainerDied", Data:"25ca27efd02b23dfb16eb38293a045f336e43fb8a61453ff282479a8c0ce7c03"} ``` Namespace was deleted around 16:11:13, timed out after 1m30s: ``` STEP: Creating an uninitialized pod in the namespace May 2 16:11:11.320: INFO: error from create uninitialized namespace: <nil> STEP: Deleting the namespace STEP: Waiting for the namespace to be removed. May 2 16:12:43.409: INFO: Unexpected error occurred: timed out waiting for the condition ```
Created attachment 1561999 [details] test output
Created attachment 1562000 [details] v=6 controller-manager logs
to sum up, I still don't know what is happening but now there are more verbose logs. > Seem like the pod is still trying to start but the secret it needs has been deleted. we can rule this out, because those log messages happen after the pod's container has started, meaning the *initial* mount succeeded. the logs are referring only to the periodic *re*mounts which continue for a while but are not the reason the pod can't be killed, kubelet is refusing to kill the pod for some other reason (it doesn't see the pod has a deletionTimestamp on it and/or the containers are still running).
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1703875 to unblock serial job. What I've missed so far is that the failing pod is always "uninitialized." The test seems to assume the pod won't start because it's uninitialized so it only gives 1m30s from the moment the pod is `Created` for the pod to start->stop which is too optimistic. This patch does not address the random failures in other non-serial tests, those still need digging into.
https://github.com/openshift/origin/pull/22755 *** sorry for spam.
Checked with builds from 780~784 in CI job https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/ and did not see the e2e fail/flake. Move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758