Pods that are running the Multus components (e.g. pods in the openshift-multus namespace), especially the "multus" daemonset, are causing an error shown from hyperkube that reads: "Unable to get Process Stats: couldn't open cpu cgroup procs file" This doesn't appear to be caused by the multus-admission-controller. Example error follows: ----------------- [root@ci-ln-csl69i2-f76d1-9zlx8-worker-b-5vf4g /]# journalctl | grep -i "cgroup" Sep 11 21:33:01 ci-ln-csl69i2-f76d1-9zlx8-worker-b-5vf4g hyperkube[1528]: I0911 21:33:01.227435 1528 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6cffd1ff_d5ec_4db1_afd9_1af7b7662564.slice/crio-30f8e8f3a37a272774eacc7d01da848dd50eaefc5dabb423bc84a902930628a1.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6cffd1ff_d5ec_4db1_afd9_1af7b7662564.slice/crio-30f8e8f3a37a272774eacc7d01da848dd50eaefc5dabb423bc84a902930628a1.scope/cgroup.procs: no such file or directory ----------------- To associate these errors with a pod, first query the cluster for the crio ids such as: oc get pod -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.initContainerStatuses[*].containerID}{.status.containerStatuses[*].containerID}{"\n"}{end}' -A | grep -i multus Then, grep the journal for this error, and the IDs from the above command such as: journalctl | grep cgroup.proc | grep -P "($id|$another_id|$etc)" This was discovered by Cameron Meadors while investigating BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1875950
Created attachment 1723858 [details] Logs showing a container stuck in "Unable to get Process Stats" Logs from bug 1891143, where this also turned up.
^ those log are from 4.6.0-rc.4, with some comments on them in [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1891143#c4
Moving to the node team, since this seems like a kubelet/CRI-O fumble, and not something particular to the managed pods.
Not looked into this yet. Moving to the next sprint.
The error reported in this bug is thrown from cAdvisor, trying to fetch a cgroups path that is no more there. How cAdvisor gets out of sync is not trivial to understand. Apart from the error message, there is no effect on the cluster: the cgroups path is not fetched, as it is not there. This error hints that some error happened on the cluster, but it is not the root cause nor directly related. As a final note, we don't have reports of this error on recent OCP releases (4.7 or above). I think this will not happen anymore. Closing for now. Please reopen if needed.