Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1904558

Summary: Random init-p error when trying to start pod
Product: OpenShift Container Platform Reporter: huizenga
Component: NodeAssignee: Kir Kolyshkin <kir>
Node sub component: CRI-O QA Contact: MinLi <minmli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: achernet, agurenko, aos-bugs, dwalsh, jokerman, kir, nagrawal, pehunt, redhat-info, sperezto, tsweeney
Version: 4.6.zKeywords: AutomationBlocker
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:34:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kube manifests to deploy operator none

Description huizenga 2020-12-04 18:32:34 UTC
Created attachment 1736511 [details]
kube manifests to deploy operator

Description of problem:

Error: container create failed: time="2020-12-04T18:15:36Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"


I have a set of kube manifests files that deploy an operator. I see random behavior when applying these files into my OCP 4.6.6 cluster, it will work one time, fail the next. Sometimes it will fix itself after several attempts. This is the only workload running in the entire aws OCP 4.6.6 cluster built using the tryopenshift process


Version-Release number of selected component (if applicable):
OCP 4.6.6 built using the tryopenshift process running on AWS


How reproducible:
Random
oc apply -f <attached file>
oc delete -f <attached file>
repeat process, works sometimes fails others


Steps to Reproduce:
1.oc apply -f <attached file>
2.oc delete -f <attached file>
3.repeat process several times and describe the pod in the "ibm-sample-panamax-operator-system" ns

Actual result

Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       13m                default-scheduler  Successfully assigned ibm-sample-panamax-operator-system/ibm-sample-panamax-operator-controller-manager-6747f68844-4v424 to ip-10-0-189-181.us-east-2.compute.internal
  Normal   AddedInterface  13m                multus             Add eth0 [10.128.3.171/23]
  Normal   Pulled          13m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 1.476447204s
  Warning  Failed          13m                kubelet            Error: container create failed: time="2020-12-04T18:15:36Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"
  Normal   Pulled          13m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 958.432307ms
  Warning  Failed          13m                kubelet            Error: container create failed: time="2020-12-04T18:15:43Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"
  Normal   Pulled          13m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 1.022095502s
  Warning  Failed          13m                kubelet            Error: container create failed: time="2020-12-04T18:16:04Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"
  Normal   Pulled          13m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 888.78649ms
  Warning  Failed          12m                kubelet            Error: container create failed: time="2020-12-04T18:16:22Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"
  Normal   Pulled          12m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 976.252614ms
  Warning  Failed          12m                kubelet            Error: container create failed: time="2020-12-04T18:16:42Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"


Expected results:

pod starts up consistently without mutiple errors and retries


Additional info:

Comment 1 Tom Sweeney 2020-12-04 20:31:31 UTC
As this came in at the very end of the current sprint, this will be looked into in the next sprint.

Comment 6 Kir Kolyshkin 2021-03-01 03:08:22 UTC
The behavior greatly depends on the kernel (and I believe the "randomness factor" comes from the kernel, might also be from golang runtime). Anyway, I have made some improvements in runc to handle cgroup limit failures and OOM kills better and to report more sensible status:

- https://github.com/opencontainers/runc/pull/2812
- https://github.com/opencontainers/runc/pull/2814

(both already merged and will be available in rc94)

I would not say it fixes everything but it is a definitive improvement.

Still looking into runc memory usage during init.

Comment 8 Harshal Patil 2021-05-17 09:45:53 UTC
*** Bug 1959322 has been marked as a duplicate of this bug. ***

Comment 9 Kir Kolyshkin 2021-05-21 17:44:36 UTC
> Customer is asking why did not end with a OOMKill?. If there is a limit blocking the creation, it should be a OOMKill, right?.

Silvio,

This is an OOM kill, but due to it happening in a very early stage of container start, it was not caught as usual. Recent changes in runc (see my comment above) made the error message less confusing (i.e. it now reports about OOM if there was one during that early stage).

Comment 10 Kir Kolyshkin 2021-05-21 17:48:35 UTC
Should be fixed by 
* https://github.com/kubernetes/kubernetes/pull/102196 (Kubernetes 1.21)
* https://github.com/kubernetes/kubernetes/pull/102147 (Kubernetes master)

Comment 13 MinLi 2021-06-04 10:34:22 UTC
try several times and can't reproduce on version  4.8.0-0.nightly-2021-06-03-221810.
verified.

Comment 16 errata-xmlrpc 2021-07-27 22:34:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438