Bug 1883991

Summary: FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: NodeAssignee: Harshal Patil <harpatil>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, ccoleman, dwalsh, fdeutsch, harpatil, jokerman, nagrawal, tsweeney
Version: 4.6Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1914022 (view as bug list) Environment:
Last Closed: 2021-01-20 05:19:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1896387    

Description W. Trevor King 2020-09-30 17:05:31 UTC
Once, in Proxy CI [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864/build-log.txt | grep 'pod-submit-status-1-0\|Pod Container Status should never report success for a pending container'
started: (0/1745/2411) "[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]"
Sep 30 16:20:38.587: FAIL: timed out waiting for watch events for pod-submit-status-1-0
Sep 30 16:20:38.646: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-1-0: { } Scheduled: Successfully assigned e2e-pods-3686/pod-submit-status-1-0 to ip-10-0-64-101.ec2.internal
Sep 30 16:20:38.646: INFO: At 2020-09-30 16:15:39 +0000 UTC - event for pod-submit-status-1-0: {multus } AddedInterface: Add eth0 [10.129.2.136/23]
Sep 30 16:20:38.646: INFO: At 2020-09-30 16:15:40 +0000 UTC - event for pod-submit-status-1-0: {kubelet ip-10-0-64-101.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Sep 30 16:20:38.663: INFO: pod-submit-status-1-0  ip-10-0-64-101.ec2.internal  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-09-30 16:15:38 +0000 UTC  }]
fail [k8s.io/kubernetes.0/test/e2e/node/pods.go:323]: Sep 30 16:20:38.587: timed out waiting for watch events for pod-submit-status-1-0
Sep 30 16:15:38.211 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ reason/Created
Sep 30 16:15:38.243 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/Scheduled
Sep 30 16:15:38.577 W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/GracefulDelete in 1s
Sep 30 16:15:39.945 I ns/e2e-pods-3686 pod/pod-submit-status-1-0 reason/AddedInterface Add eth0 [10.129.2.136/23]
Sep 30 16:15:40.473 W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Sep 30 16:16:49.853 - 224s  W ns/e2e-pods-3686 pod/pod-submit-status-1-0 node/ip-10-0-64-101.ec2.internal pod has been pending longer than a minute
failed: (5m2s) 2020-09-30T16:20:38 "[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container [Suite:openshift/conformance/parallel] [Suite:k8s]"
...


CI search suggests this is rare:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Failed+to+create+pod+sandbox.*json+message%3A+EOF&maxAge=48h&type=junit' | grep 'failures match'
pull-ci-openshift-openshift-apiserver-master-e2e-cmd - 5 runs, 60% failed, 33% of failures match

(dunno why it hasn't indexed my proxy job yet).

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864

Comment 1 W. Trevor King 2020-10-01 03:57:11 UTC
Happened again [1]:

Oct 01 02:54:30.947 W ns/e2e-pods-1974 pod/pod-submit-status-2-14 node/ip-10-0-71-148.us-east-2.compute.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF

Possibly image access or some such through the proxy is just slower enough to make us more likely to trip over a timeout?

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311478040370352128

Comment 2 W. Trevor King 2020-10-01 04:01:51 UTC
[1] shows this test failing in six of the past 15 proxy jobs that made it far enough to run the e2e suite, so yeah, pretty common in that evironment.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy

Comment 3 W. Trevor King 2020-10-01 04:55:49 UTC
Node journal from ip-10-0-64-101 from comment 0:

  $ LOG_LINES="$(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864/artifacts/e2e-aws-proxy/gather-extra/nodes/ip-10-0-64-101.ec2.internal/journal | gunzip | grep 'pod-submit-status-1-0')"
  $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | head -n1Sep 30 16:15:42.193301 ip-10-0-64-101 hyperkube[1485]: I0930 16:15:42.192674    1485 kubelet_pods.go:980] Pod "pod-submit-status-1-0_e2e-pods-3686(104d65f0-3047-49a9-bec6-001ac4cb0012)" is terminated, but pod cgroup sandbox has not been cleaned up
  $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | wc -l
131

That error message is also mentioned in bug 1822872, which was closed INSUFFICIENT_DATA.  Maybe the proxy CI environment will trigger this frequently enough for further debugging.

Comment 4 Seth Jennings 2020-10-01 14:12:58 UTC
Peter, could you look at this one?

Comment 10 Peter Hunt 2020-11-10 16:55:42 UTC
*** Bug 1887857 has been marked as a duplicate of this bug. ***

Comment 14 Clayton Coleman 2021-01-12 15:13:44 UTC
> If one follows the pod whose creation failed, you can see above that the kubelet kills the cgroup before the infra container is created for the pod. I am not sure why api server is sending a api create and then delete request so quickly

The e2e test specifically runs this to ensure the kubelet is correctly functional in these sorts of race conditions (we had a number of serious consistency issues in Kubelet, the test exposes them).  So think of this as the test focusing in on a scenario that often exposes race conditions or bugs in startup that would otherwise be much rarer (but still very bad in production).