Bug 1904558
| Summary: | Random init-p error when trying to start pod | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | huizenga | ||||
| Component: | Node | Assignee: | Kir Kolyshkin <kir> | ||||
| Node sub component: | CRI-O | QA Contact: | MinLi <minmli> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | unspecified | CC: | achernet, agurenko, aos-bugs, dwalsh, jokerman, kir, nagrawal, pehunt, redhat-info, sperezto, tsweeney | ||||
| Version: | 4.6.z | Keywords: | AutomationBlocker | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.8.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-07-27 22:34:40 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
As this came in at the very end of the current sprint, this will be looked into in the next sprint. The behavior greatly depends on the kernel (and I believe the "randomness factor" comes from the kernel, might also be from golang runtime). Anyway, I have made some improvements in runc to handle cgroup limit failures and OOM kills better and to report more sensible status: - https://github.com/opencontainers/runc/pull/2812 - https://github.com/opencontainers/runc/pull/2814 (both already merged and will be available in rc94) I would not say it fixes everything but it is a definitive improvement. Still looking into runc memory usage during init. *** Bug 1959322 has been marked as a duplicate of this bug. *** > Customer is asking why did not end with a OOMKill?. If there is a limit blocking the creation, it should be a OOMKill, right?.
Silvio,
This is an OOM kill, but due to it happening in a very early stage of container start, it was not caught as usual. Recent changes in runc (see my comment above) made the error message less confusing (i.e. it now reports about OOM if there was one during that early stage).
Should be fixed by * https://github.com/kubernetes/kubernetes/pull/102196 (Kubernetes 1.21) * https://github.com/kubernetes/kubernetes/pull/102147 (Kubernetes master) try several times and can't reproduce on version 4.8.0-0.nightly-2021-06-03-221810. verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Created attachment 1736511 [details] kube manifests to deploy operator Description of problem: Error: container create failed: time="2020-12-04T18:15:36Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer" I have a set of kube manifests files that deploy an operator. I see random behavior when applying these files into my OCP 4.6.6 cluster, it will work one time, fail the next. Sometimes it will fix itself after several attempts. This is the only workload running in the entire aws OCP 4.6.6 cluster built using the tryopenshift process Version-Release number of selected component (if applicable): OCP 4.6.6 built using the tryopenshift process running on AWS How reproducible: Random oc apply -f <attached file> oc delete -f <attached file> repeat process, works sometimes fails others Steps to Reproduce: 1.oc apply -f <attached file> 2.oc delete -f <attached file> 3.repeat process several times and describe the pod in the "ibm-sample-panamax-operator-system" ns Actual result Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 13m default-scheduler Successfully assigned ibm-sample-panamax-operator-system/ibm-sample-panamax-operator-controller-manager-6747f68844-4v424 to ip-10-0-189-181.us-east-2.compute.internal Normal AddedInterface 13m multus Add eth0 [10.128.3.171/23] Normal Pulled 13m kubelet Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 1.476447204s Warning Failed 13m kubelet Error: container create failed: time="2020-12-04T18:15:36Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer" Normal Pulled 13m kubelet Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 958.432307ms Warning Failed 13m kubelet Error: container create failed: time="2020-12-04T18:15:43Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer" Normal Pulled 13m kubelet Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 1.022095502s Warning Failed 13m kubelet Error: container create failed: time="2020-12-04T18:16:04Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer" Normal Pulled 13m kubelet Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 888.78649ms Warning Failed 12m kubelet Error: container create failed: time="2020-12-04T18:16:22Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer" Normal Pulled 12m kubelet Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 976.252614ms Warning Failed 12m kubelet Error: container create failed: time="2020-12-04T18:16:42Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer" Expected results: pod starts up consistently without mutiple errors and retries Additional info: