1776236 – Container fails to start with "lstat /proc/63538/ns/ipc: no such file or directory"

Bug 1776236 - Container fails to start with "lstat /proc/63538/ns/ipc: no such file or directory"

Summary: Container fails to start with "lstat /proc/63538/ns/ipc: no such file or dire...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1771572
TreeView+	depends on / blocked

Reported:	2019-11-25 10:27 UTC by Denis Ollier
Modified:	2023-10-06 18:49 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:12:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt cluster-network-addons-operator pull 342	None	closed	drop limitations of resources	2021-02-11 00:50:09 UTC
Github	openshift machine-config-operator pull 1689	None	closed	Bug 1831866: cri-o: manage ns lifecycle, again!	2021-02-11 00:50:10 UTC
Red Hat Bugzilla	1813350	urgent	CLOSED	Workaround BZ 1776236 (lstat /proc/63538/ns/ipc)	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:12:47 UTC

Description Denis Ollier 2019-11-25 10:27:28 UTC

Description of problem
----------------------

I deployed CNV 2.2 from marketplace on an OCP 4.3 bare metal cluster and kube-cni-linux-bridge-plugin was failing:

> Error: container create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/63538/ns/ipc\" caused \"lstat /proc/63538/ns/ipc: no such file or directory\""

Relevant traces:

> kubectl -n openshift-cnv describe pod/kube-cni-linux-bridge-plugin
> 
> Status:         Pending
> IP:             10.128.0.55
> Controlled By:  DaemonSet/kube-cni-linux-bridge-plugin
> Containers:
>   cni-plugins:
>     Container ID:  
>     Image:         registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-containernetworking-plugins:v2.2.0-2
>     Image ID:      
>     Port:          <none>
>     Host Port:     <none>
>     Command:
>       /bin/bash
>       -c
>       cp -rf /usr/src/containernetworking/plugins/bin/*bridge /opt/cni/bin/
>       cp -rf /usr/src/containernetworking/plugins/bin/*tuning /opt/cni/bin/
>       # Some projects (e.g. openshift/console) use cnv- prefix to distinguish between
>       # binaries shipped by OpenShift and those shipped by KubeVirt (D/S matters).
>       # Following two lines make sure we will provide both names when needed.
>       find /opt/cni/bin/cnv-bridge || ln -s /opt/cni/bin/bridge /opt/cni/bin/cnv-bridge
>       find /opt/cni/bin/cnv-tuning || ln -s /opt/cni/bin/tuning /opt/cni/bin/cnv-tuning
>       echo "Entering sleep... (success)"
>       sleep infinity
>       
>     State:          Waiting
>       Reason:       CreateContainerError
>     Ready:          False
>     Restart Count:  0
>     Limits:
>       cpu:     60m
>       memory:  30Mi
>     Requests:
>       cpu:        60m
>       memory:     30Mi
>     Environment:  <none>
>     Mounts:
>       /opt/cni/bin from cnibin (rw)
>       /var/run/secrets/kubernetes.io/serviceaccount from linux-bridge-token-r6r4l (ro)
> Conditions:
>   Type              Status
>   Initialized       True 
>   Ready             False 
>   ContainersReady   False 
>   PodScheduled      True 
> Volumes:
>   cnibin:
>     Type:          HostPath (bare host directory volume)
>     Path:          /var/lib/cni/bin
>     HostPathType:  
>   linux-bridge-token-r6r4l:
>     Type:        Secret (a volume populated by a Secret)
>     SecretName:  linux-bridge-token-r6r4l
>     Optional:    false
> QoS Class:       Guaranteed
> Node-Selectors:  beta.kubernetes.io/arch=amd64
> Tolerations:     node-role.kubernetes.io/master:NoSchedule
>                  node.kubernetes.io/disk-pressure:NoSchedule
>                  node.kubernetes.io/memory-pressure:NoSchedule
>                  node.kubernetes.io/not-ready:NoExecute
>                  node.kubernetes.io/pid-pressure:NoSchedule
>                  node.kubernetes.io/unreachable:NoExecute
>                  node.kubernetes.io/unschedulable:NoSchedule
> Events:
>   Type     Reason     Age                   From                                              > Message
>   ----     ------     ----                  ----                                              -------
>   Normal   Scheduled  <unknown>             default-scheduler                                 Successfully assigned openshift-cnv/kube-cni-linux-bridge-plugin-p8q4j to cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com
>   Normal   Pulling    12m                   kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com  Pulling image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-containernetworking-plugins:v2.2.0-2"
>   Normal   Pulled     12m                   kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com  Successfully pulled image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-containernetworking-plugins:v2.2.0-2"
>   Warning  Failed     12m                   kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com  Error: error reading container (probably exited) json message: EOF
>   Warning  Failed     9m54s (x11 over 12m)  kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com  Error: container create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/63538/ns/ipc\" caused \"lstat /proc/63538/ns/ipc: no such file or directory\""
>   Normal   Pulled     2m50s (x39 over 12m)  kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com  Container image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-containernetworking-plugins:v2.2.0-2" already present on machine

Version-Release number of selected component
--------------------------------------------

- CNV 2.2
- Deployed using HCO image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-hyperconverged-cluster-operator:v2.2.0-6
- OCP 4.3.0-0.nightly-2019-11-12-204120

Additional info
---------------

I undeployed and redeployed CNV several times. I observed the issue most of the time but not always on the same node(s).

Comment 1 Petr Horáček 2019-11-28 11:18:37 UTC

Denis, is our container the only one that fails with this error on your cluster? I don't think this is related to anything in our container. It looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1763583. Is it possible that it is caused by a faulty environment and was fixed in newer versions of OCP?

Comment 2 Denis Ollier 2019-11-28 11:36:32 UTC

On my BM cluster, the problem occurs only with kube-cni-linux-bridge-plugin pods.

I had to undeploy/redeploy CNV 4 times before all kube-cni-linux-bridge-plugin managed to get Ready.

I couldn't figure a common factor to all the failures. They occured on different nodes each time. Both master and workers were affected.

Comment 3 Petr Horáček 2019-12-03 13:04:21 UTC

Denis, I still want to believe it was just an issue in the environment. Could you please check it for the last time? With the latest possible OCP and CNV. If you still see it, I'd start investigating it.

Comment 4 Denis Ollier 2019-12-04 09:55:40 UTC

(In reply to Petr Horáček from comment #3)
> Denis, I still want to believe it was just an issue in the environment.
> Could you please check it for the last time? With the latest possible OCP
> and CNV. If you still see it, I'd start investigating it.

As often you are well inspired. I retried with latest software versions and couldn't reproduce the issue.

- CNV 2.2
- Deployed using HCO image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-hyperconverged-cluster-operator:v2.2.0-9
- OCP 4.3.0-0.nightly-2019-11-29-051144

Comment 5 Petr Horáček 2019-12-04 10:40:24 UTC

Thanks Denis! :)

Comment 6 Petr Horáček 2020-01-21 10:30:13 UTC

I noticed this issue yesterday on a bare-metal cluster. It had linux bridge CNI and kubernetes-nmstate pods restarted many times do to OOM. After some of these restarts it got stuck for a while on the very same issue. Both of these pods use host network namespace. I don't think this is a CNV bug. Moving it to OCP.

OCP people, we have an issue where pods using host network namespace fail to start with "container_linux.go:1897: running lstat on namespace path \"/proc/63538/ns/ipc\" caused \"lstat /proc/63538/ns/ipc: no such file or directory\"" after restart sometimes. Are you aware of this issue?

Comment 7 Stephen Cuppett 2020-01-21 15:08:47 UTC

Setting target release to current development branch (4.4). Fixes, if any, requested/required on previous releases will result in cloned BZs targeting the z-stream releases where appropriate.

Comment 8 Weibin Liang 2020-02-11 14:46:39 UTC

Hi Denis Ollier,

I am openshift QE and try to help our developers to debug this bug locally, could you guide me how to deploy CNV 2.2 from marketplace on an OCP 4.3 bare metal cluster and kube-cni-linux-bridge-plugin? Any docs I can follow? Thanks!

Comment 15 Petr Horáček 2020-02-28 15:35:46 UTC

Since this bug happens for Pods in both host netns and on SDN, is seems unlikely to be caused by networking. After discussion with Aniket, we realized this may need help from the kubelet team.

Comment 16 Denis Ollier 2020-02-28 15:40:53 UTC

It seems this issue only happens with RHCOS workers. RHEL workers don't have such problem.

Comment 17 Denis Ollier 2020-03-02 11:50:50 UTC

I deployed an OCP 4 cluster with RHEL 7 workers.

kube-cni-linux-bridge-plugin pods are failing on masters (RHCOS) but not on workers (RHEL).

Comment 18 Peter Hunt 2020-03-03 16:37:56 UTC

Long shot, but are there any OOM kills in dmesg?

I suspect this is not strictly a problem with this particular pod, nor a problem with RHCOS more than RHEL.

Comment 19 Ryan Phillips 2020-03-03 16:38:49 UTC

Might be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1806786 and https://github.com/openshift/origin/pull/24611

Comment 20 Denis Ollier 2020-03-10 18:08:51 UTC

Issue is still present with OCP 4.4.0-0.nightly-2020-03-10-042427 which should include the fix for BZ#1806786 IIUC.

Comment 21 Nelly Credi 2020-03-11 08:44:51 UTC

Looks like this bug got pushed out to 4.5, but this is affecting CNV2.3 which should be released on the same day with OCP4.4.
This is a blocker for our release and we must get it fixed

Comment 22 Mark McLoughlin 2020-03-11 11:05:19 UTC

(In reply to Denis Ollier from comment #20)
> Issue is still present with OCP 4.4.0-0.nightly-2020-03-10-042427 which
> should include the fix for BZ#1806786 IIUC.

Do you have any further information on the fastest and most reliable path to reproducing this? i.e. should everybody installing CNV 2.3 on a bare metal OCP 4.4 cluster expect to immediately see this on all nodes? Or is there anything you have to do post-install to reliably trigger it? Thanks.

(It's probably useful to be explicit in your answer whether OOM is involved at all?)

Comment 23 Denis Ollier 2020-03-11 12:50:57 UTC

(In reply to Mark McLoughlin from comment #22)
> 
> Do you have any further information on the fastest and most reliable path to
> reproducing this?

I don't have any shortcut to reproduce this, I just deploy CNV 2.3 and see those kube-cni-linux-bridge-plugin pods failing.

> i.e. should everybody installing CNV 2.3 on a bare metal
> OCP 4.4 cluster expect to immediately see this on all nodes?

Since I don't do anything fancy on my cluster, I would say yes.

> Or is there anything you have to do post-install to reliably trigger it? Thanks.

No trick, just deploy CNV 2.3 and see those pods failing.

> (It's probably useful to be explicit in your answer whether OOM is involved
> at all?)

I don't think so, pods state is not OOMKilled but CreateContainerError, moreover my nodes have 190GB memory which should be sufficient.

Comment 27 Yuval Kashtan 2020-03-11 20:54:19 UTC

this
https://github.com/cri-o/cri-o/issues/1927
might be relevant too
especially the comment about low limits (this pod is really low, setting cpu: "60m")

so maybe try with no limits to see if it is solving the issue, then setting a higher number

Comment 28 Petr Horáček 2020-03-12 10:20:10 UTC

It indeed seems to be caused by the low limit. I dropped all the limits (kept requests only) and the cluster seems to be healthy. We need to see if it stays that way. In the meanwhile, I'm working on a PR to cluster-network-addons-operator dropping limits from our components.

Comment 29 Petr Horáček 2020-03-12 11:07:49 UTC

Collecting information about the bug, hopefully to help OCP/cri-o folks to resolve it.

In CNV networking components, we set resource limits and requests to the same value which is quite low. AFAIK, other OCP components usually set higher limit or no limit at all. I'm not aware of any issues connected to it while we tested it on our Kubernetes setups. However, when our QE started testing on OCP, it started failing with "lstat /proc/63538/ns/ipc: no such file or directory".

As Yuval suggested, we tried to increase the limits. After we did, the issue did no repeat. We tried to set it back to low limit and it manifested shortly after.

In CNV, we will work around this issue merely by dropping of the limit setting (as other components do). However, this bug still remains a threat in case the node is low on resources and pod is pushed all the way down to its minimal requests.

Comment 30 Petr Horáček 2020-03-12 11:09:46 UTC

Nelly, I think we need one bug to truck CNV workaround and one to track proper solution on OCP. What do you suggest we do here? Should we keep this instance for us and clone it for OCP?

Comment 33 Petr Horáček 2020-03-13 14:49:42 UTC

Since this is an OCP issue, I'm moving it away from CNV. I will open a bug tracking our workaround.

See https://bugzilla.redhat.com/show_bug.cgi?id=1776236#c29 with tl;dr of the issue.

Comment 44 Peter Hunt 2020-05-13 19:37:25 UTC

PR is merged, moving this to modified

Comment 51 Oscar Casal Sanchez 2020-06-08 13:32:09 UTC

Hello,

Is this going to be backported to 4.2?

Regards,
Oscar

Comment 52 Peter Hunt 2020-06-08 13:45:37 UTC

No there is no plan to backport this to 4.2

Comment 56 errata-xmlrpc 2020-07-13 17:12:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
augol
cnv-qe-bugs
danken
eparis
fdeutsch
ggore
jokerman
markmc
ncredi
ocasalsa
pehunt
phoracek
rphillips
scuppett
sgordon
stirabos
weliang
xtian
ykashtan