Bug 1883991 - FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Summary: FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknow...
Keywords:
Status: CLOSED DUPLICATE of bug 1915085
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Harshal Patil
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1887857 (view as bug list)
Depends On:
Blocks: 1896387
TreeView+ depends on / blocked
 
Reported: 2020-09-30 17:05 UTC by W. Trevor King
Modified: 2021-02-11 14:01 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1914022 (view as bug list)
Environment:
Last Closed: 2021-01-20 05:19:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2020-09-30 17:05:31 UTC
Once, in Proxy CI [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864/build-log.txt | grep 'pod-submit-status-1-0\|Pod Container Status should never report success for a pending container'
started: (0/1745/2411) "[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]"
Sep 30 16:20:38.587: FAIL: timed out waiting for watch events for pod-submit-status-1-0
Sep 30 16:20:38.646: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-1-0: { } Scheduled: Successfully assigned e2e-pods-3686/pod-submit-status-1-0 to ip-10-0-64-101.ec2.internal
Sep 30 16:20:38.646: INFO: At 2020-09-30 16:15:39 +0000 UTC - event for pod-submit-status-1-0: {multus } AddedInterface: Add eth0 [10.129.2.136/23]
Sep 30 16:20:38.646: INFO: At 2020-09-30 16:15:40 +0000 UTC - event for pod-submit-status-1-0: {kubelet ip-10-0-64-101.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Sep 30 16:20:38.663: INFO: pod-submit-status-1-0  ip-10-0-64-101.ec2.internal  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC  }]
fail [k8s.io/kubernetes.0/test/e2e/node/pods.go:323]: Sep 30 16:20:38.587: timed out waiting for watch events for pod-submit-status-1-0
Sep 30 16:15:38.211 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ reason/Created
Sep 30 16:15:38.243 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/Scheduled
Sep 30 16:15:38.577 W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/GracefulDelete in 1s
Sep 30 16:15:39.945 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 reason/AddedInterface Add eth0 [10.129.2.136/23]
Sep 30 16:15:40.473 W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Sep 30 16:16:49.853 - 224s  W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal pod has been pending longer than a minute
failed: (5m2s) 2020-09-30T16:20:38 "[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]"
...


CI search suggests this is rare:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Failed+to+create+pod+sandbox.*json+message%3A+EOF&maxAge=48h&type=junit' | grep 'failures match'
pull-ci-openshift-openshift-apiserver-master-e2e-cmd - 5 runs, 60% failed, 33% of failures match

(dunno why it hasn't indexed my proxy job yet).

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864

Comment 1 W. Trevor King 2020-10-01 03:57:11 UTC
Happened again [1]:

Oct 01 02:54:30.947 W ns/e2e-pods-1974 pod/pod-submit-status-2-14 node/ip-10-0-71-148.us-east-2.compute.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF

Possibly image access or some such through the proxy is just slower enough to make us more likely to trip over a timeout?

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311478040370352128

Comment 2 W. Trevor King 2020-10-01 04:01:51 UTC
[1] shows this test failing in six of the past 15 proxy jobs that made it far enough to run the e2e suite, so yeah, pretty common in that evironment.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy

Comment 3 W. Trevor King 2020-10-01 04:55:49 UTC
Node journal from ip-10-0-64-101 from comment 0:

  $ LOG_LINES="$(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864/artifacts/e2e-aws-proxy/gather-extra/nodes/ip-10-0-64-101.ec2.internal/journal | gunzip | grep 'pod-submit-status-1-0')"
  $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | head -n1Sep 30 16:15:42.193301 ip-10-0-64-101 hyperkube[1485]: I0930 16:15:42.192674    1485 kubelet_pods.go:980] Pod "pod-submit-status-1-0_e2e-pods-3686(104d65f0-3047-49a9-bec6-001ac4cb0012)" is terminated, but pod cgroup sandbox has not been cleaned up
  $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | wc -l
131

That error message is also mentioned in bug 1822872, which was closed INSUFFICIENT_DATA.  Maybe the proxy CI environment will trigger this frequently enough for further debugging.

Comment 4 Seth Jennings 2020-10-01 14:12:58 UTC
Peter, could you look at this one?

Comment 10 Peter Hunt 2020-11-10 16:55:42 UTC
*** Bug 1887857 has been marked as a duplicate of this bug. ***

Comment 14 Clayton Coleman 2021-01-12 15:13:44 UTC
> If one follows the pod whose creation failed, you can see above that the kubelet kills the cgroup before the infra container is created for the pod. I am not sure why api server is sending a api create and then delete request so quickly

The e2e test specifically runs this to ensure the kubelet is correctly functional in these sorts of race conditions (we had a number of serious consistency issues in Kubelet, the test exposes them).  So think of this as the test focusing in on a scenario that often exposes race conditions or bugs in startup that would otherwise be much rarer (but still very bad in production).


Note You need to log in before you can comment on or make changes to this bug.