Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1959322

Summary:	Cannot start container with CreateContainerError for the cnf-app-mac-operator
Product:	OpenShift Container Platform	Reporter:	Gurenko Alex <agurenko>
Component:	Node	Assignee:	Harshal Patil <harpatil>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aadam, aos-bugs, ehashman, fidencio, harpatil, mcornea
Version:	4.7
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-17 09:45:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gurenko Alex 2021-05-11 09:34:31 UTC

Description of problem: I'm trying to run operator on OCP 4.7 that works fine on OCP 4.6 and getting CreateContainerError for the container


Version-Release number of selected component (if applicable):

Issue presents itself on fresh installation of 4.7.9 and 4.7.10 on virt and on 4.7.9 on Baremetal


How reproducible: 100%


Steps to Reproduce:
1. Trying to deploy operator by creating a subscription
2. Check CSV status
3. Check pod logs

Actual results:

[kni@provisionhost-0-0 ~]$ oc get pods
NAME                                                       READY   STATUS                 RESTARTS   AGE
cnf-app-mac-operator-controller-manager-5b7fc47489-xjqkm   0/1     CreateContainerError   4          55m

[kni@provisionhost-0-0 ~]$ oc describe pod cnf-app-mac-operator-controller-manager-5b7fc47489-xjqkm

Events:
Type     Reason                  Age   From               Message
----     ------                  ----  ----               -------
Normal   Scheduled               2m9s  default-scheduler  Successfully assigned example-cnf/cnf-app-mac-operator-controller-manager-5b7fc47489-xjqkm to worker-0-0
Normal   AddedInterface          2m7s  multus             Add eth0 [10.128.2.21/23]
Warning  FailedCreatePodSandBox  2m2s  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = `/usr/bin/runc --root /run/runc start 95568f5f9e92e4fa294ba0ab382ac5c8dec19baa502b22e8b03b272423efc94c` failed: time="2021-05-11T0
8:25:07Z" level=error msg="cannot start a container that has stopped"
(exit status 1)
Normal   AddedInterface          119s                multus   Add eth0 [10.128.2.21/23]
Normal   Pulled                  113s                kubelet  Successfully pulled image "registry.agurenko-cluster-0.lab.eng.tlv2.redhat.com:5000/rh-nfv-int/cnf-app-mac-operator:v0.2.4" in 2.0681892s
Warning  Failed                  112s                kubelet  Error: container create failed: time="2021-05-11T08:25:17Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read ini
t-p: connection reset by peer"
Normal   Pulled                  111s                kubelet  Successfully pulled image "registry.agurenko-cluster-0.lab.eng.tlv2.redhat.com:5000/rh-nfv-int/cnf-app-mac-operator:v0.2.4" in 24.072625ms
Warning  Failed                  110s                kubelet  Error: container create failed: time="2021-05-11T08:25:19Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read ini
t-p: connection reset by peer"
Normal   Pulled                  98s                 kubelet  Successfully pulled image "registry.agurenko-cluster-0.lab.eng.tlv2.redhat.com:5000/rh-nfv-int/cnf-app-mac-operator:v0.2.4" in 25.255614ms
Warning  Failed                  96s                 kubelet  Error: container create failed: time="2021-05-11T08:25:33Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: read ini
t-p: connection reset by peer"
Normal   Pulling                 81s (x4 over 116s)  kubelet  Pulling image "registry.agurenko-cluster-0.lab.eng.tlv2.redhat.com:5000/rh-nfv-int/cnf-app-mac-operator:v0.2.4"
Normal   Pulled                  81s                 kubelet  Successfully pulled image "registry.agurenko-cluster-0.lab.eng.tlv2.redhat.com:5000/rh-nfv-int/cnf-app-mac-operator:v0.2.4" in 25.349043ms
Warning  Failed                  78s                 kubelet  Error: container create failed: time="2021-05-11T08:25:51Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:382: sending config to init process
caused: write init-p: broken pipe"
Normal   AddedInterface          76s                 multus   Add eth0 [10.128.2.21/23]
Warning  FailedCreatePodSandBox  74s                 kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-05-11T08:25:55Z" level=error msg="container_linux.go:366: starting container process cause
d: process_linux.go:472: container init caused: "
Normal   AddedInterface          59s                 multus   Add eth0 [10.128.2.21/23]
Warning  FailedCreatePodSandBox  57s                 kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-05-11T08:26:12Z" level=error msg="container_linux.go:366: starting container process cause
d: process_linux.go:472: container init caused: "
Normal   AddedInterface          41s                 multus   Add eth0 [10.128.2.21/23]
Warning  FailedCreatePodSandBox  39s                 kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = `/usr/bin/runc --root /run/runc start e5044ea06069bd843398baf232588b227a85eedc75a8302715920c4b3244caa0` failed: time="2021-05-
11T08:26:30Z" level=error msg="cannot start a container that has stopped"
(exit status 1)
Normal   AddedInterface          26s                multus   Add eth0 [10.128.2.21/23]
Warning  FailedCreatePodSandBox  24s                kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-05-11T08:26:45Z" level=error msg="container_linux.go:366: starting container process caused
: process_linux.go:472: container init caused: process_linux.go:461: writing syncT 'resume' caused: write init-p: broken pipe"
Normal   SandboxChanged          11s (x5 over 77s)  kubelet  Pod sandbox changed, it will be killed and re-created.
Normal   AddedInterface          9s                 multus   Add eth0 [10.128.2.21/23]
Warning  FailedCreatePodSandBox  7s                 kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-05-11T08:27:01Z" level=error msg="container_linux.go:366: starting container process caused
: process_linux.go:472: container init caused: read init-p: connection reset by peer"


Expected results:

Pod starts successfully and operator successfully installed

Additional info:
Same operator works on various OCP 4.6 versions

Comment 2 Gurenko Alex 2021-05-11 11:21:36 UTC

Just FYI, this might be not kata related, I'm not sure if I picked a right component, it's a regular deployment of OCP

Comment 3 Fabiano Fidêncio 2021-05-11 11:55:36 UTC

This is definitely not related to kata-containers, at all.  The container failing to start is using runc and the operator used is not the `sandboxed-containers` one.
Let me re-assign it to what I think that may be the right component.

Comment 4 Federico Paolinelli 2021-05-11 13:40:57 UTC

This is not the right component, sorry.

Comment 5 Gurenko Alex 2021-05-11 14:04:01 UTC

I'm not sure that CNF is a right group either, it looks more like a general podman/runc issue?

I've been monitoring the env, it looks like oom-kill is invoked for unknown reasons:

May 11 13:59:36 worker-0-1 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-ec1a1e99d5bcbea421595974f82ade54748ce554eeff196cc50e987add54d1df.scope,mems_allowed=0,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-bursta>
May 11 13:59:36 worker-0-1 kernel: Memory cgroup out of memory: Killed process 104706 (runc:[2:INIT]) total-vm:729424kB, anon-rss:6144kB, file-rss:1200kB, shmem-rss:0kB, UID:0
May 11 13:59:36 worker-0-1 hyperkube[3522]: I0511 13:59:36.324442    3522 oom_watcher_linux.go:76] Got sys oom event: &{104706 runc:[2:INIT] 2021-05-11 13:59:35.14392774 +0000 UTC m=+4139.686261680 / / }
May 11 13:59:36 worker-0-1 kernel: oom_reaper: reaped process 104706 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The host has 26G of available RAM

[kni@provisionhost-0-0 ~]$ oc describe node worker-0-1

Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource           Requests      Limits
--------           --------      ------
cpu                1234m (16%)   300m (4%)
memory             4718Mi (15%)  158Mi (0%)
ephemeral-storage  0 (0%)        0 (0%)
hugepages-1Gi      0 (0%)        0 (0%)
hugepages-2Mi      0 (0%)        0 (0%)
Events:
Type     Reason     Age                   From     Message
----     ------     ----                  ----     -------
Warning  SystemOOM  19m                   kubelet  System OOM encountered, victim process: elasticsearch-o, pid: 73894
Warning  SystemOOM  19m                   kubelet  System OOM encountered, victim process: pod, pid: 73605
Warning  SystemOOM  19m                   kubelet  System OOM encountered, victim process: elasticsearch-o, pid: 74801
Warning  SystemOOM  19m                   kubelet  System OOM encountered, victim process: pod, pid: 74694
Warning  SystemOOM  18m                   kubelet  System OOM encountered, victim process: elasticsearch-o, pid: 75903
Warning  SystemOOM  18m                   kubelet  System OOM encountered, victim process: pod, pid: 75503
Warning  SystemOOM  18m                   kubelet  System OOM encountered, victim process: elasticsearch-o, pid: 77169
Warning  SystemOOM  18m                   kubelet  System OOM encountered, victim process: pod, pid: 76678
Warning  SystemOOM  17m                   kubelet  System OOM encountered, victim process: elasticsearch-o, pid: 79353
Warning  SystemOOM  2m15s (x26 over 17m)  kubelet  (combined from similar events): System OOM encountered, victim process: runc:[2:INIT], pid: 107285

Comment 6 Fabiano Fidêncio 2021-05-11 17:34:08 UTC

Moving this to the Node team, specifically to the Memory manager subcomponent.

The subcomponent may be wrong, but I think now you're being redirected to the correct component, Alex.

Comment 7 Gurenko Alex 2021-05-11 18:23:51 UTC

(In reply to Fabiano Fidêncio from comment #6)
> Moving this to the Node team, specifically to the Memory manager
> subcomponent.
> 
> The subcomponent may be wrong, but I think now you're being redirected to
> the correct component, Alex.

Thanks a lot!

Comment 8 Harshal Patil 2021-05-17 09:45:57 UTC


*** This bug has been marked as a duplicate of bug 1904558 ***