Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1904558

Summary:

Random init-p error when trying to start pod

Product:

OpenShift Container Platform

Reporter:

huizenga

Component:

Node

Assignee:

Kir Kolyshkin <kir>

Node sub component:

CRI-O

QA Contact:

MinLi <minmli>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

achernet, agurenko, aos-bugs, dwalsh, jokerman, kir, nagrawal, pehunt, redhat-info, sperezto, tsweeney

Version:

4.6.z

Keywords:

AutomationBlocker

Target Milestone:

---

Target Release:

4.8.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-07-27 22:34:40 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
kube manifests to deploy operator	none

Description huizenga 2020-12-04 18:32:34 UTC

Created attachment 1736511 [details]
kube manifests to deploy operator

Description of problem:

Error: container create failed: time="2020-12-04T18:15:36Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"


I have a set of kube manifests files that deploy an operator. I see random behavior when applying these files into my OCP 4.6.6 cluster, it will work one time, fail the next. Sometimes it will fix itself after several attempts. This is the only workload running in the entire aws OCP 4.6.6 cluster built using the tryopenshift process


Version-Release number of selected component (if applicable):
OCP 4.6.6 built using the tryopenshift process running on AWS


How reproducible:
Random
oc apply -f <attached file>
oc delete -f <attached file>
repeat process, works sometimes fails others


Steps to Reproduce:
1.oc apply -f <attached file>
2.oc delete -f <attached file>
3.repeat process several times and describe the pod in the "ibm-sample-panamax-operator-system" ns

Actual result

Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       13m                default-scheduler  Successfully assigned ibm-sample-panamax-operator-system/ibm-sample-panamax-operator-controller-manager-6747f68844-4v424 to ip-10-0-189-181.us-east-2.compute.internal
  Normal   AddedInterface  13m                multus             Add eth0 [10.128.3.171/23]
  Normal   Pulled          13m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 1.476447204s
  Warning  Failed          13m                kubelet            Error: container create failed: time="2020-12-04T18:15:36Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"
  Normal   Pulled          13m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 958.432307ms
  Warning  Failed          13m                kubelet            Error: container create failed: time="2020-12-04T18:15:43Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"
  Normal   Pulled          13m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 1.022095502s
  Warning  Failed          13m                kubelet            Error: container create failed: time="2020-12-04T18:16:04Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"
  Normal   Pulled          13m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 888.78649ms
  Warning  Failed          12m                kubelet            Error: container create failed: time="2020-12-04T18:16:22Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"
  Normal   Pulled          12m                kubelet            Successfully pulled image "docker.io/ibmcom/ibm-sample-panamax-operator@sha256:ff2491c84644620be0e1cef23d26b67387ae4e3da404f0ee1178f9f507021b0b" in 976.252614ms
  Warning  Failed          12m                kubelet            Error: container create failed: time="2020-12-04T18:16:42Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read init-p: connection reset by peer"


Expected results:

pod starts up consistently without mutiple errors and retries


Additional info:

Comment 1 Tom Sweeney 2020-12-04 20:31:31 UTC

As this came in at the very end of the current sprint, this will be looked into in the next sprint.

Comment 6 Kir Kolyshkin 2021-03-01 03:08:22 UTC

The behavior greatly depends on the kernel (and I believe the "randomness factor" comes from the kernel, might also be from golang runtime). Anyway, I have made some improvements in runc to handle cgroup limit failures and OOM kills better and to report more sensible status:

- https://github.com/opencontainers/runc/pull/2812
- https://github.com/opencontainers/runc/pull/2814

(both already merged and will be available in rc94)

I would not say it fixes everything but it is a definitive improvement.

Still looking into runc memory usage during init.

Comment 8 Harshal Patil 2021-05-17 09:45:53 UTC

*** Bug 1959322 has been marked as a duplicate of this bug. ***

Comment 9 Kir Kolyshkin 2021-05-21 17:44:36 UTC

> Customer is asking why did not end with a OOMKill?. If there is a limit blocking the creation, it should be a OOMKill, right?.

Silvio,

This is an OOM kill, but due to it happening in a very early stage of container start, it was not caught as usual. Recent changes in runc (see my comment above) made the error message less confusing (i.e. it now reports about OOM if there was one during that early stage).

Comment 10 Kir Kolyshkin 2021-05-21 17:48:35 UTC

Should be fixed by 
* https://github.com/kubernetes/kubernetes/pull/102196 (Kubernetes 1.21)
* https://github.com/kubernetes/kubernetes/pull/102147 (Kubernetes master)

Comment 13 MinLi 2021-06-04 10:34:22 UTC

try several times and can't reproduce on version  4.8.0-0.nightly-2021-06-03-221810.
verified.

Comment 16 errata-xmlrpc 2021-07-27 22:34:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438