Bug 1883991

Summary:	FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Node	Assignee:	Harshal Patil <harpatil>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, ccoleman, dwalsh, fdeutsch, harpatil, jokerman, nagrawal, tsweeney
Version:	4.6	Keywords:	UpcomingSprint
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1914022 (view as bug list)		Environment:
Last Closed:	2021-01-20 05:19:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1896387

Description W. Trevor King 2020-09-30 17:05:31 UTC

Once, in Proxy CI [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864/build-log.txt | grep 'pod-submit-status-1-0\|Pod Container Status should never report success for a pending container'
started: (0/1745/2411) "[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]"
Sep 30 16:20:38.587: FAIL: timed out waiting for watch events for pod-submit-status-1-0
Sep 30 16:20:38.646: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-1-0: { } Scheduled: Successfully assigned e2e-pods-3686/pod-submit-status-1-0 to ip-10-0-64-101.ec2.internal
Sep 30 16:20:38.646: INFO: At 2020-09-30 16:15:39 +0000 UTC - event for pod-submit-status-1-0: {multus } AddedInterface: Add eth0 [10.129.2.136/23]
Sep 30 16:20:38.646: INFO: At 2020-09-30 16:15:40 +0000 UTC - event for pod-submit-status-1-0: {kubelet ip-10-0-64-101.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Sep 30 16:20:38.663: INFO: pod-submit-status-1-0  ip-10-0-64-101.ec2.internal  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC  }]
fail [k8s.io/kubernetes.0/test/e2e/node/pods.go:323]: Sep 30 16:20:38.587: timed out waiting for watch events for pod-submit-status-1-0
Sep 30 16:15:38.211 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ reason/Created
Sep 30 16:15:38.243 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/Scheduled
Sep 30 16:15:38.577 W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/GracefulDelete in 1s
Sep 30 16:15:39.945 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 reason/AddedInterface Add eth0 [10.129.2.136/23]
Sep 30 16:15:40.473 W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Sep 30 16:16:49.853 - 224s  W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal pod has been pending longer than a minute
failed: (5m2s) 2020-09-30T16:20:38 "[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]"
...


CI search suggests this is rare:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Failed+to+create+pod+sandbox.*json+message%3A+EOF&maxAge=48h&type=junit' | grep 'failures match'
pull-ci-openshift-openshift-apiserver-master-e2e-cmd - 5 runs, 60% failed, 33% of failures match

(dunno why it hasn't indexed my proxy job yet).

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864

Comment 1 W. Trevor King 2020-10-01 03:57:11 UTC

Happened again [1]:

Oct 01 02:54:30.947 W ns/e2e-pods-1974 pod/pod-submit-status-2-14 node/ip-10-0-71-148.us-east-2.compute.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF

Possibly image access or some such through the proxy is just slower enough to make us more likely to trip over a timeout?

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311478040370352128

Comment 2 W. Trevor King 2020-10-01 04:01:51 UTC

[1] shows this test failing in six of the past 15 proxy jobs that made it far enough to run the e2e suite, so yeah, pretty common in that evironment.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy

Comment 3 W. Trevor King 2020-10-01 04:55:49 UTC

Node journal from ip-10-0-64-101 from comment 0:

  $ LOG_LINES="$(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864/artifacts/e2e-aws-proxy/gather-extra/nodes/ip-10-0-64-101.ec2.internal/journal | gunzip | grep 'pod-submit-status-1-0')"
  $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | head -n1Sep 30 16:15:42.193301 ip-10-0-64-101 hyperkube[1485]: I0930 16:15:42.192674    1485 kubelet_pods.go:980] Pod "pod-submit-status-1-0_e2e-pods-3686(104d65f0-3047-49a9-bec6-001ac4cb0012)" is terminated, but pod cgroup sandbox has not been cleaned up
  $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | wc -l
131

That error message is also mentioned in bug 1822872, which was closed INSUFFICIENT_DATA.  Maybe the proxy CI environment will trigger this frequently enough for further debugging.

Comment 4 Seth Jennings 2020-10-01 14:12:58 UTC

Peter, could you look at this one?

Comment 10 Peter Hunt 2020-11-10 16:55:42 UTC

*** Bug 1887857 has been marked as a duplicate of this bug. ***

Comment 14 Clayton Coleman 2021-01-12 15:13:44 UTC

> If one follows the pod whose creation failed, you can see above that the kubelet kills the cgroup before the infra container is created for the pod. I am not sure why api server is sending a api create and then delete request so quickly

The e2e test specifically runs this to ensure the kubelet is correctly functional in these sorts of race conditions (we had a number of serious consistency issues in Kubelet, the test exposes them).  So think of this as the test focusing in on a scenario that often exposes race conditions or bugs in startup that would otherwise be much rarer (but still very bad in production).