Description of problem ---------------------- I deployed CNV 2.2 from marketplace on an OCP 4.3 bare metal cluster and kube-cni-linux-bridge-plugin was failing: > Error: container create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/63538/ns/ipc\" caused \"lstat /proc/63538/ns/ipc: no such file or directory\"" Relevant traces: > kubectl -n openshift-cnv describe pod/kube-cni-linux-bridge-plugin > > Status: Pending > IP: 10.128.0.55 > Controlled By: DaemonSet/kube-cni-linux-bridge-plugin > Containers: > cni-plugins: > Container ID: > Image: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-containernetworking-plugins:v2.2.0-2 > Image ID: > Port: <none> > Host Port: <none> > Command: > /bin/bash > -c > cp -rf /usr/src/containernetworking/plugins/bin/*bridge /opt/cni/bin/ > cp -rf /usr/src/containernetworking/plugins/bin/*tuning /opt/cni/bin/ > # Some projects (e.g. openshift/console) use cnv- prefix to distinguish between > # binaries shipped by OpenShift and those shipped by KubeVirt (D/S matters). > # Following two lines make sure we will provide both names when needed. > find /opt/cni/bin/cnv-bridge || ln -s /opt/cni/bin/bridge /opt/cni/bin/cnv-bridge > find /opt/cni/bin/cnv-tuning || ln -s /opt/cni/bin/tuning /opt/cni/bin/cnv-tuning > echo "Entering sleep... (success)" > sleep infinity > > State: Waiting > Reason: CreateContainerError > Ready: False > Restart Count: 0 > Limits: > cpu: 60m > memory: 30Mi > Requests: > cpu: 60m > memory: 30Mi > Environment: <none> > Mounts: > /opt/cni/bin from cnibin (rw) > /var/run/secrets/kubernetes.io/serviceaccount from linux-bridge-token-r6r4l (ro) > Conditions: > Type Status > Initialized True > Ready False > ContainersReady False > PodScheduled True > Volumes: > cnibin: > Type: HostPath (bare host directory volume) > Path: /var/lib/cni/bin > HostPathType: > linux-bridge-token-r6r4l: > Type: Secret (a volume populated by a Secret) > SecretName: linux-bridge-token-r6r4l > Optional: false > QoS Class: Guaranteed > Node-Selectors: beta.kubernetes.io/arch=amd64 > Tolerations: node-role.kubernetes.io/master:NoSchedule > node.kubernetes.io/disk-pressure:NoSchedule > node.kubernetes.io/memory-pressure:NoSchedule > node.kubernetes.io/not-ready:NoExecute > node.kubernetes.io/pid-pressure:NoSchedule > node.kubernetes.io/unreachable:NoExecute > node.kubernetes.io/unschedulable:NoSchedule > Events: > Type Reason Age From > Message > ---- ------ ---- ---- ------- > Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-cnv/kube-cni-linux-bridge-plugin-p8q4j to cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com > Normal Pulling 12m kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com Pulling image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-containernetworking-plugins:v2.2.0-2" > Normal Pulled 12m kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com Successfully pulled image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-containernetworking-plugins:v2.2.0-2" > Warning Failed 12m kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com Error: error reading container (probably exited) json message: EOF > Warning Failed 9m54s (x11 over 12m) kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com Error: container create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/63538/ns/ipc\" caused \"lstat /proc/63538/ns/ipc: no such file or directory\"" > Normal Pulled 2m50s (x39 over 12m) kubelet, cnv-qe-11.cnvqe.lab.eng.rdu2.redhat.com Container image "registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-containernetworking-plugins:v2.2.0-2" already present on machine Version-Release number of selected component -------------------------------------------- - CNV 2.2 - Deployed using HCO image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-hyperconverged-cluster-operator:v2.2.0-6 - OCP 4.3.0-0.nightly-2019-11-12-204120 Additional info --------------- I undeployed and redeployed CNV several times. I observed the issue most of the time but not always on the same node(s).
Denis, is our container the only one that fails with this error on your cluster? I don't think this is related to anything in our container. It looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1763583. Is it possible that it is caused by a faulty environment and was fixed in newer versions of OCP?
On my BM cluster, the problem occurs only with kube-cni-linux-bridge-plugin pods. I had to undeploy/redeploy CNV 4 times before all kube-cni-linux-bridge-plugin managed to get Ready. I couldn't figure a common factor to all the failures. They occured on different nodes each time. Both master and workers were affected.
Denis, I still want to believe it was just an issue in the environment. Could you please check it for the last time? With the latest possible OCP and CNV. If you still see it, I'd start investigating it.
(In reply to Petr Horáček from comment #3) > Denis, I still want to believe it was just an issue in the environment. > Could you please check it for the last time? With the latest possible OCP > and CNV. If you still see it, I'd start investigating it. As often you are well inspired. I retried with latest software versions and couldn't reproduce the issue. - CNV 2.2 - Deployed using HCO image registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-hyperconverged-cluster-operator:v2.2.0-9 - OCP 4.3.0-0.nightly-2019-11-29-051144
Thanks Denis! :)
I noticed this issue yesterday on a bare-metal cluster. It had linux bridge CNI and kubernetes-nmstate pods restarted many times do to OOM. After some of these restarts it got stuck for a while on the very same issue. Both of these pods use host network namespace. I don't think this is a CNV bug. Moving it to OCP. OCP people, we have an issue where pods using host network namespace fail to start with "container_linux.go:1897: running lstat on namespace path \"/proc/63538/ns/ipc\" caused \"lstat /proc/63538/ns/ipc: no such file or directory\"" after restart sometimes. Are you aware of this issue?
Setting target release to current development branch (4.4). Fixes, if any, requested/required on previous releases will result in cloned BZs targeting the z-stream releases where appropriate.
Hi Denis Ollier, I am openshift QE and try to help our developers to debug this bug locally, could you guide me how to deploy CNV 2.2 from marketplace on an OCP 4.3 bare metal cluster and kube-cni-linux-bridge-plugin? Any docs I can follow? Thanks!
Since this bug happens for Pods in both host netns and on SDN, is seems unlikely to be caused by networking. After discussion with Aniket, we realized this may need help from the kubelet team.
It seems this issue only happens with RHCOS workers. RHEL workers don't have such problem.
I deployed an OCP 4 cluster with RHEL 7 workers. kube-cni-linux-bridge-plugin pods are failing on masters (RHCOS) but not on workers (RHEL).
Long shot, but are there any OOM kills in dmesg? I suspect this is not strictly a problem with this particular pod, nor a problem with RHCOS more than RHEL.
Might be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1806786 and https://github.com/openshift/origin/pull/24611
Issue is still present with OCP 4.4.0-0.nightly-2020-03-10-042427 which should include the fix for BZ#1806786 IIUC.
Looks like this bug got pushed out to 4.5, but this is affecting CNV2.3 which should be released on the same day with OCP4.4. This is a blocker for our release and we must get it fixed
(In reply to Denis Ollier from comment #20) > Issue is still present with OCP 4.4.0-0.nightly-2020-03-10-042427 which > should include the fix for BZ#1806786 IIUC. Do you have any further information on the fastest and most reliable path to reproducing this? i.e. should everybody installing CNV 2.3 on a bare metal OCP 4.4 cluster expect to immediately see this on all nodes? Or is there anything you have to do post-install to reliably trigger it? Thanks. (It's probably useful to be explicit in your answer whether OOM is involved at all?)
(In reply to Mark McLoughlin from comment #22) > > Do you have any further information on the fastest and most reliable path to > reproducing this? I don't have any shortcut to reproduce this, I just deploy CNV 2.3 and see those kube-cni-linux-bridge-plugin pods failing. > i.e. should everybody installing CNV 2.3 on a bare metal > OCP 4.4 cluster expect to immediately see this on all nodes? Since I don't do anything fancy on my cluster, I would say yes. > Or is there anything you have to do post-install to reliably trigger it? Thanks. No trick, just deploy CNV 2.3 and see those pods failing. > (It's probably useful to be explicit in your answer whether OOM is involved > at all?) I don't think so, pods state is not OOMKilled but CreateContainerError, moreover my nodes have 190GB memory which should be sufficient.
this https://github.com/cri-o/cri-o/issues/1927 might be relevant too especially the comment about low limits (this pod is really low, setting cpu: "60m") so maybe try with no limits to see if it is solving the issue, then setting a higher number
It indeed seems to be caused by the low limit. I dropped all the limits (kept requests only) and the cluster seems to be healthy. We need to see if it stays that way. In the meanwhile, I'm working on a PR to cluster-network-addons-operator dropping limits from our components.
Collecting information about the bug, hopefully to help OCP/cri-o folks to resolve it. In CNV networking components, we set resource limits and requests to the same value which is quite low. AFAIK, other OCP components usually set higher limit or no limit at all. I'm not aware of any issues connected to it while we tested it on our Kubernetes setups. However, when our QE started testing on OCP, it started failing with "lstat /proc/63538/ns/ipc: no such file or directory". As Yuval suggested, we tried to increase the limits. After we did, the issue did no repeat. We tried to set it back to low limit and it manifested shortly after. In CNV, we will work around this issue merely by dropping of the limit setting (as other components do). However, this bug still remains a threat in case the node is low on resources and pod is pushed all the way down to its minimal requests.
Nelly, I think we need one bug to truck CNV workaround and one to track proper solution on OCP. What do you suggest we do here? Should we keep this instance for us and clone it for OCP?
Since this is an OCP issue, I'm moving it away from CNV. I will open a bug tracking our workaround. See https://bugzilla.redhat.com/show_bug.cgi?id=1776236#c29 with tl;dr of the issue.
PR is merged, moving this to modified
Hello, Is this going to be backported to 4.2? Regards, Oscar
No there is no plan to backport this to 4.2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409