1875950 – [kata] Unable to get Process Stats: couldn't open cpu cgroup procs file

Bug 1875950 - [kata] Unable to get Process Stats: couldn't open cpu cgroup procs file

Summary: [kata] Unable to get Process Stats: couldn't open cpu cgroup procs file

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	sandboxed-containers
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Francesco Giudici
QA Contact:	Cameron Meadors
Docs Contact:
URL:
Whiteboard:
Depends On:	1879205
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-04 17:20 UTC by Cameron Meadors
Modified:	2021-11-12 15:08 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-12 15:08:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1877614	1	None	None	None	2022-08-26 19:25:33 UTC

Description Cameron Meadors 2020-09-04 17:20:57 UTC

Description of problem:

Seeing errors like this while testing crio needed for kata containers:

ep 04 17:07:46 cmead-kata17-4zxgx-worker-a-vf9t7.c.openshift-qe.internal hyperkube[3415]: I0904 17:07:46.830997    3415 handler.go:181] Unable to get Process Stats: couldn't open cpu cgroup procs file /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9319b182_6015_4491_9d87_cc0c988d6782.slice/crio-0be4ddcd18526e4935741148daa3281e61913251df097da4d6c0dad62b5ce17c.scope/cgroup.procs : open /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod9319b182_6015_4491_9d87_cc0c988d6782.slice/crio-0be4ddcd18526e4935741148daa3281e61913251df097da4d6c0dad62b5ce17c.scope/cgroup.procs: no such file or directory


Version-Release number of selected component (if applicable):

Openshift version: 4.6.0-0.nightly-2020-09-01-070508

Custom crio
crio version 1.19.0-dev
Version:       1.19.0-dev
GitCommit:     6bda9b5e079bdee399262330259970f1a40c16be
GitTreeState:  clean
BuildDate:     2020-09-04T15:13:19Z
GoVersion:     go1.14.6
Compiler:      gc
Platform:      linux/amd64
Linkmode:      dynamic


How reproducible:

Starts to happen slowly when new crio binaries are put in place and slowly gets more frequent until it is a constant stream.  I have created a kata pod at least once each time I have reproduced this.  I don't know if that triggers it or just having the binary in place for other pods to start using.  It continues to happen if I delete the kata pods.


Steps to Reproduce:
1. Stop crio service
2. replace crio crio-status and pinns with custom build
3. Start crio service

Actual results:

I can deploy kata pods and cluster health does not seem to be affected, but logs show errors.

Expected results:

No errors

Additional info:

PR where the custom builds originated: https://github.com/cri-o/cri-o/pull/4151

I think there is also a fix for a logging problem with crio found in CI.  When I find that I will add it here.

Comment 1 Cameron Meadors 2020-09-04 17:26:44 UTC

must-gather logs:

http://file.bos.redhat.com/cmeadors/must-gather.local.6130838836126021496.tgz

Comment 3 Christophe de Dinechin 2020-09-09 13:32:04 UTC

(In reply to Cameron Meadors from comment #0)
> Starts to happen slowly when new crio binaries are put in place and slowly
> gets more frequent until it is a constant stream.  I have created a kata pod
> at least once each time I have reproduced this.  I don't know if that
> triggers it or just having the binary in place for other pods to start
> using.  It continues to happen if I delete the kata pods.

Do you mean that when you put a new crio in place, installing it is sufficient to "reset" the frequency of the messages?

Could you describe the procedure to replace the crio binaries (i.e. do you shutdown / restart something else, etc)?

Comment 4 Fabiano Fidêncio 2020-09-09 13:54:46 UTC

Cameron,

Could you check whether the issue happens when only running "normal" containers? What I'm trying to figure out here is whether this may be an issue in general or it's specifically tied to kata-containers.
I'd recommend to:
- Spawn a new cluster;
- Create the very same containers you created when doing your tests, using the default runtime;
- Let the cluster be for a few hours;
- Check if the same errors are present;

If they are, it's not a kata issue; If they're not, then I'd like to ask you to:
- Create the very same containers your created when doing your tests, but this time using kata as the runtime;
- Verify if the issues will show up in the log;

That would help us immensely to track the issue down.

Comment 5 Fabiano Fidêncio 2020-09-09 14:03:13 UTC

(In reply to Christophe de Dinechin from comment #3)
> (In reply to Cameron Meadors from comment #0)
> Could you describe the procedure to replace the crio binaries (i.e. do you
> shutdown / restart something else, etc)?

This is the step-by-step I gave to Cameron:

What you want to do:
* Get the binaries inside the node;
* Get root access to the node (sudo su ) should work once you're connected as core;
* Remout /usr as read-write (mount -o remount,rw /usr)
* Stop CRI-O; (systemctl stop crio` )
* Replace the binaries;  (if you can throw the cluster away, just a cp crio crio-status pinns /usr/bin/  should do
* Start CRI-O ( systemctl start crio` )
* Be happy!

Comment 6 Fabiano Fidêncio 2020-09-10 06:59:39 UTC

After reading https://bugzilla.redhat.com/show_bug.cgi?id=1877614 I strongly believe this a kata specific error and I believe it'll be fixed for the next packages we provide.

This can be moved to kata (instead of Node), and we'll need to retest once the new packages are built and a new image is provided for the operator.

Comment 7 Christophe de Dinechin 2020-09-10 15:19:28 UTC

Agree with Fabiano (In reply to Fabiano Fidêncio from comment #6)
> After reading https://bugzilla.redhat.com/show_bug.cgi?id=1877614 I strongly
> believe this a kata specific error and I believe it'll be fixed for the next
> packages we provide.
> 
> This can be moved to kata (instead of Node), and we'll need to retest once
> the new packages are built and a new image is provided for the operator.

I agree. Looks like we simply exhaust some resources required to mount the necessary filesystems. I moved it to kata-containers to avoid noise on the Node side, will have to move back to Node if we are wrong.

Comment 8 Cameron Meadors 2020-09-10 20:14:31 UTC

Retested on openshift build 4.6.0-0.nightly-2020-09-10-121352.  Still seeing the same errors.

# crio --version
crio version 1.19.0-11.rhaos4.6.gitf83564f.el8-rc.1
Version:    1.19.0-11.rhaos4.6.gitf83564f.el8-rc.1
GoVersion:  go1.14.7
Compiler:   gc
Platform:   linux/amd64
Linkmode:   dynamic

Based on the logs, the errors started when I deployed a kata pod.  I think this confirms what we already suspected.  I was hoping to see it stop when I deleted the pod, but the pod stuck in terminating.  Logs only show this:

failed to try resolving symlinks in path "/var/log/pods/default_example-fedora_42950b48-520a-4d82-bedf-64538f19f227/example-fedora/0.log": lstat /var/log/pods/default_example-fedora_42950b48-520a-4d82-bedf-64538f19f227: no such file or directory

Events show:

Normal  Killing         12m   kubelet, cmead-kata18-xwznz-worker-c-bm4dn.c.openshift-qe.internal  Stopping container example-fedora

Going to let it sit as see if it ever finishes terminating.  Is the same as https://issues.redhat.com/browse/KATA-229?

Comment 11 Cameron Meadors 2020-09-11 15:42:46 UTC

I discovered that all the bad paths are coming from containers in multus pods.  I even deleted one of the multus pods and the new one that replaced it had the same issue.  Asking multus experts to take a look at it to rule out an issue with multus.

Comment 12 Fabiano Fidêncio 2020-09-14 14:59:04 UTC

Let me add my guess here.

kata-runtime was wrongly applying the cgroups constrains inside the VM and there were a bunch of fixes related to this recently (and those are not present in the package Cameron tested). My guess is that Multus pods, for some reason, may be exposing those issues on the kata pods. Toghether with those issues on the runtime side, there's at least one issue on the agent side related to this as well.

Comment 13 Douglas Smith 2020-09-15 16:31:16 UTC

After some investigation with Cameron, I'm lead to believe this isn't kata specific. I can reliably reproduce on 4.6 CI clusters as launched by cluster-bot.

I've opened a separate BZ @ https://bugzilla.redhat.com/show_bug.cgi?id=1879205

Comment 14 Cameron Meadors 2020-09-17 13:17:51 UTC

Fabiano, could you link to PRs that you made related to this bug?  Trying to decide what to do with this bug.

Comment 15 Fabiano Fidêncio 2020-09-17 14:04:35 UTC

(In reply to Cameron Meadors from comment #14)
> Fabiano, could you link to PRs that you made related to this bug?  Trying to
> decide what to do with this bug.

Partially. There's a bunch of PRs on kata stepping back on the cgroups support:
- https://github.com/kata-containers/runtime/pull/2793
- https://github.com/kata-containers/runtime/pull/2817
- https://github.com/kata-containers/runtime/pull/2944

Between rebases, backports yet not released, and patches yet not merged (the last PR in the list), those are the parts touching cgroups that may have some influence on the issue you've reported.

All-in-all, we (as kata containers) were trying to set cgroups within the VM and due to several issues found, we took a step back on that for now.

Comment 18 Cameron Meadors 2020-11-12 15:29:12 UTC

Do we need to keep this open as there is a more general bug filed against multus?  Do we need to track this for kata?

Comment 19 Christophe de Dinechin 2020-11-13 08:55:28 UTC

(In reply to Cameron Meadors from comment #18)
> Do we need to keep this open as there is a more general bug filed against
> multus?  Do we need to track this for kata?

Wny not mark the other as a blocker, just as a reminder that we need to test once this is fixed in MUltus?

Comment 20 Francesco Giudici 2020-11-13 10:34:43 UTC

(In reply to Christophe de Dinechin from comment #19)
> (In reply to Cameron Meadors from comment #18)
> > Do we need to keep this open as there is a more general bug filed against
> > multus?  Do we need to track this for kata?
> 
> Wny not mark the other as a blocker, just as a reminder that we need to test
> once this is fixed in MUltus?

Seems a good idea to me: added the bug on CRI-O as a blocker of this one.

Comment 22 Francesco Giudici 2021-11-12 15:08:41 UTC

This error has not been observed anymore.
Investigating on OCP (see bug #1879205) this error message comes from cAdvisor, which tried to fetch
metrics data from a non-existent cgroups path.
I think this is solved in OCP 4.7 and later.
Closing, please reopen if needed.

Note You need to log in before you can comment on or make changes to this bug.