Description of problem: Seeing errors like this while testing crio needed for kata containers: ep 04 17:07:46 cmead-kata17-4zxgx-worker-a-vf9t7.c.openshift-qe.internal hyperkube[3415]: I0904 17:07:46.830997 3415 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9319b182_6015_4491_9d87_cc0c988d6782.slice/crio-0be4ddcd18526e4935741148daa3281e61913251df097da4d6c0dad62b5ce17c.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9319b182_6015_4491_9d87_cc0c988d6782.slice/crio-0be4ddcd18526e4935741148daa3281e61913251df097da4d6c0dad62b5ce17c.scope/cgroup.procs: no such file or directory Version-Release number of selected component (if applicable): Openshift version: 4.6.0-0.nightly-2020-09-01-070508 Custom crio crio version 1.19.0-dev Version: 1.19.0-dev GitCommit: 6bda9b5e079bdee399262330259970f1a40c16be GitTreeState: clean BuildDate: 2020-09-04T15:13:19Z GoVersion: go1.14.6 Compiler: gc Platform: linux/amd64 Linkmode: dynamic How reproducible: Starts to happen slowly when new crio binaries are put in place and slowly gets more frequent until it is a constant stream. I have created a kata pod at least once each time I have reproduced this. I don't know if that triggers it or just having the binary in place for other pods to start using. It continues to happen if I delete the kata pods. Steps to Reproduce: 1. Stop crio service 2. replace crio crio-status and pinns with custom build 3. Start crio service Actual results: I can deploy kata pods and cluster health does not seem to be affected, but logs show errors. Expected results: No errors Additional info: PR where the custom builds originated: https://github.com/cri-o/cri-o/pull/4151 I think there is also a fix for a logging problem with crio found in CI. When I find that I will add it here.
must-gather logs: http://file.bos.redhat.com/cmeadors/must-gather.local.6130838836126021496.tgz
(In reply to Cameron Meadors from comment #0) > Starts to happen slowly when new crio binaries are put in place and slowly > gets more frequent until it is a constant stream. I have created a kata pod > at least once each time I have reproduced this. I don't know if that > triggers it or just having the binary in place for other pods to start > using. It continues to happen if I delete the kata pods. Do you mean that when you put a new crio in place, installing it is sufficient to "reset" the frequency of the messages? Could you describe the procedure to replace the crio binaries (i.e. do you shutdown / restart something else, etc)?
Cameron, Could you check whether the issue happens when only running "normal" containers? What I'm trying to figure out here is whether this may be an issue in general or it's specifically tied to kata-containers. I'd recommend to: - Spawn a new cluster; - Create the very same containers you created when doing your tests, using the default runtime; - Let the cluster be for a few hours; - Check if the same errors are present; If they are, it's not a kata issue; If they're not, then I'd like to ask you to: - Create the very same containers your created when doing your tests, but this time using kata as the runtime; - Verify if the issues will show up in the log; That would help us immensely to track the issue down.
(In reply to Christophe de Dinechin from comment #3) > (In reply to Cameron Meadors from comment #0) > Could you describe the procedure to replace the crio binaries (i.e. do you > shutdown / restart something else, etc)? This is the step-by-step I gave to Cameron: What you want to do: * Get the binaries inside the node; * Get root access to the node (sudo su ) should work once you're connected as core; * Remout /usr as read-write (mount -o remount,rw /usr) * Stop CRI-O; (systemctl stop crio` ) * Replace the binaries; (if you can throw the cluster away, just a cp crio crio-status pinns /usr/bin/ should do * Start CRI-O ( systemctl start crio` ) * Be happy!
After reading https://bugzilla.redhat.com/show_bug.cgi?id=1877614 I strongly believe this a kata specific error and I believe it'll be fixed for the next packages we provide. This can be moved to kata (instead of Node), and we'll need to retest once the new packages are built and a new image is provided for the operator.
Agree with Fabiano (In reply to Fabiano Fidêncio from comment #6) > After reading https://bugzilla.redhat.com/show_bug.cgi?id=1877614 I strongly > believe this a kata specific error and I believe it'll be fixed for the next > packages we provide. > > This can be moved to kata (instead of Node), and we'll need to retest once > the new packages are built and a new image is provided for the operator. I agree. Looks like we simply exhaust some resources required to mount the necessary filesystems. I moved it to kata-containers to avoid noise on the Node side, will have to move back to Node if we are wrong.
Retested on openshift build 4.6.0-0.nightly-2020-09-10-121352. Still seeing the same errors. # crio --version crio version 1.19.0-11.rhaos4.6.gitf83564f.el8-rc.1 Version: 1.19.0-11.rhaos4.6.gitf83564f.el8-rc.1 GoVersion: go1.14.7 Compiler: gc Platform: linux/amd64 Linkmode: dynamic Based on the logs, the errors started when I deployed a kata pod. I think this confirms what we already suspected. I was hoping to see it stop when I deleted the pod, but the pod stuck in terminating. Logs only show this: failed to try resolving symlinks in path "/var/log/pods/default_example-fedora_42950b48-520a-4d82-bedf-64538f19f227/example-fedora/0.log": lstat /var/log/pods/default_example-fedora_42950b48-520a-4d82-bedf-64538f19f227: no such file or directory Events show: Normal Killing 12m kubelet, cmead-kata18-xwznz-worker-c-bm4dn.c.openshift-qe.internal Stopping container example-fedora Going to let it sit as see if it ever finishes terminating. Is the same as https://issues.redhat.com/browse/KATA-229?
I discovered that all the bad paths are coming from containers in multus pods. I even deleted one of the multus pods and the new one that replaced it had the same issue. Asking multus experts to take a look at it to rule out an issue with multus.
Let me add my guess here. kata-runtime was wrongly applying the cgroups constrains inside the VM and there were a bunch of fixes related to this recently (and those are not present in the package Cameron tested). My guess is that Multus pods, for some reason, may be exposing those issues on the kata pods. Toghether with those issues on the runtime side, there's at least one issue on the agent side related to this as well.
After some investigation with Cameron, I'm lead to believe this isn't kata specific. I can reliably reproduce on 4.6 CI clusters as launched by cluster-bot. I've opened a separate BZ @ https://bugzilla.redhat.com/show_bug.cgi?id=1879205
Fabiano, could you link to PRs that you made related to this bug? Trying to decide what to do with this bug.
(In reply to Cameron Meadors from comment #14) > Fabiano, could you link to PRs that you made related to this bug? Trying to > decide what to do with this bug. Partially. There's a bunch of PRs on kata stepping back on the cgroups support: - https://github.com/kata-containers/runtime/pull/2793 - https://github.com/kata-containers/runtime/pull/2817 - https://github.com/kata-containers/runtime/pull/2944 Between rebases, backports yet not released, and patches yet not merged (the last PR in the list), those are the parts touching cgroups that may have some influence on the issue you've reported. All-in-all, we (as kata containers) were trying to set cgroups within the VM and due to several issues found, we took a step back on that for now.
Do we need to keep this open as there is a more general bug filed against multus? Do we need to track this for kata?
(In reply to Cameron Meadors from comment #18) > Do we need to keep this open as there is a more general bug filed against > multus? Do we need to track this for kata? Wny not mark the other as a blocker, just as a reminder that we need to test once this is fixed in MUltus?
(In reply to Christophe de Dinechin from comment #19) > (In reply to Cameron Meadors from comment #18) > > Do we need to keep this open as there is a more general bug filed against > > multus? Do we need to track this for kata? > > Wny not mark the other as a blocker, just as a reminder that we need to test > once this is fixed in MUltus? Seems a good idea to me: added the bug on CRI-O as a blocker of this one.
This error has not been observed anymore. Investigating on OCP (see bug #1879205) this error message comes from cAdvisor, which tried to fetch metrics data from a non-existent cgroups path. I think this is solved in OCP 4.7 and later. Closing, please reopen if needed.