Bug 1883991
| Summary: | FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
| Component: | Node | Assignee: | Harshal Patil <harpatil> | |
| Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> | |
| Status: | CLOSED DUPLICATE | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | medium | CC: | aos-bugs, ccoleman, dwalsh, fdeutsch, harpatil, jokerman, nagrawal, tsweeney | |
| Version: | 4.6 | Keywords: | UpcomingSprint | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1914022 (view as bug list) | Environment: | ||
| Last Closed: | 2021-01-20 05:19:52 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1896387 | |||
|
Description
W. Trevor King
2020-09-30 17:05:31 UTC
Happened again [1]: Oct 01 02:54:30.947 W ns/e2e-pods-1974 pod/pod-submit-status-2-14 node/ip-10-0-71-148.us-east-2.compute.internal reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Possibly image access or some such through the proxy is just slower enough to make us more likely to trip over a timeout? [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311478040370352128 [1] shows this test failing in six of the past 15 proxy jobs that made it far enough to run the e2e suite, so yeah, pretty common in that evironment. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy Node journal from ip-10-0-64-101 from comment 0: $ LOG_LINES="$(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1311317997826084864/artifacts/e2e-aws-proxy/gather-extra/nodes/ip-10-0-64-101.ec2.internal/journal | gunzip | grep 'pod-submit-status-1-0')" $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | head -n1Sep 30 16:15:42.193301 ip-10-0-64-101 hyperkube[1485]: I0930 16:15:42.192674 1485 kubelet_pods.go:980] Pod "pod-submit-status-1-0_e2e-pods-3686(104d65f0-3047-49a9-bec6-001ac4cb0012)" is terminated, but pod cgroup sandbox has not been cleaned up $ echo "${LOG_LINES}" | grep 'terminated, but pod cgroup sandbox has not been cleaned up' | wc -l 131 That error message is also mentioned in bug 1822872, which was closed INSUFFICIENT_DATA. Maybe the proxy CI environment will trigger this frequently enough for further debugging. Peter, could you look at this one? *** Bug 1887857 has been marked as a duplicate of this bug. *** > If one follows the pod whose creation failed, you can see above that the kubelet kills the cgroup before the infra container is created for the pod. I am not sure why api server is sending a api create and then delete request so quickly
The e2e test specifically runs this to ensure the kubelet is correctly functional in these sorts of race conditions (we had a number of serious consistency issues in Kubelet, the test exposes them). So think of this as the test focusing in on a scenario that often exposes race conditions or bugs in startup that would otherwise be much rarer (but still very bad in production).
|