Once, in Proxy CI [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864/build-log.txt | grep 'pod-submit-status-1-0\|Pod Container Status should never report success for a pending container' started: (0/1745/2411) "[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]" Sep 30 16:20:38.587: FAIL: timed out waiting for watch events for pod-submit-status-1-0 Sep 30 16:20:38.646: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-1-0: { } Scheduled: Successfully assigned e2e-pods-3686/pod-submit-status-1-0 to ip-10-0-64-101.ec2.internal Sep 30 16:20:38.646: INFO: At 2020-09-30 16:15:39 +0000 UTC - event for pod-submit-status-1-0: {multus } AddedInterface: Add eth0 [10.129.2.136/23] Sep 30 16:20:38.646: INFO: At 2020-09-30 16:15:40 +0000 UTC - event for pod-submit-status-1-0: {kubelet ip-10-0-64-101.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Sep 30 16:20:38.663: INFO: pod-submit-status-1-0 ip-10-0-64-101.ec2.internal Pending 1s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC }] fail [k8s.io/kubernetes.0/test/e2e/node/pods.go:323]: Sep 30 16:20:38.587: timed out waiting for watch events for pod-submit-status-1-0 Sep 30 16:15:38.211 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ reason/Created Sep 30 16:15:38.243 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/Scheduled Sep 30 16:15:38.577 W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/GracefulDelete in 1s Sep 30 16:15:39.945 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 reason/AddedInterface Add eth0 [10.129.2.136/23] Sep 30 16:15:40.473 W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Sep 30 16:16:49.853 - 224s W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal pod has been pending longer than a minute failed: (5m2s) 2020-09-30T16:20:38 "[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]" ... CI search suggests this is rare: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Failed+to+create+pod+sandbox.*json+message%3A+EOF&maxAge=48h&type=junit' | grep 'failures match' pull-ci-openshift-openshift-apiserver-master-e2e-cmd - 5 runs, 60% failed, 33% of failures match (dunno why it hasn't indexed my proxy job yet). [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864
Happened again [1]: Oct 01 02:54:30.947 W ns/e2e-pods-1974 pod/pod-submit-status-2-14 node/ip-10-0-71-148.us-east-2.compute.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Possibly image access or some such through the proxy is just slower enough to make us more likely to trip over a timeout? [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311478040370352128
[1] shows this test failing in six of the past 15 proxy jobs that made it far enough to run the e2e suite, so yeah, pretty common in that evironment. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy
Node journal from ip-10-0-64-101 from comment 0: $ LOG_LINES="$(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864/artifacts/e2e-aws-proxy/gather-extra/nodes/ip-10-0-64-101.ec2.internal/journal | gunzip | grep 'pod-submit-status-1-0')" $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | head -n1Sep 30 16:15:42.193301 ip-10-0-64-101 hyperkube[1485]: I0930 16:15:42.192674 1485 kubelet_pods.go:980] Pod "pod-submit-status-1-0_e2e-pods-3686(104d65f0-3047-49a9-bec6-001ac4cb0012)" is terminated, but pod cgroup sandbox has not been cleaned up $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | wc -l 131 That error message is also mentioned in bug 1822872, which was closed INSUFFICIENT_DATA. Maybe the proxy CI environment will trigger this frequently enough for further debugging.
Peter, could you look at this one?
*** Bug 1887857 has been marked as a duplicate of this bug. ***
> If one follows the pod whose creation failed, you can see above that the kubelet kills the cgroup before the infra container is created for the pod. I am not sure why api server is sending a api create and then delete request so quickly The e2e test specifically runs this to ensure the kubelet is correctly functional in these sorts of race conditions (we had a number of serious consistency issues in Kubelet, the test exposes them). So think of this as the test focusing in on a scenario that often exposes race conditions or bugs in startup that would otherwise be much rarer (but still very bad in production).