Bug 1806027

Summary: cAdvisor metrics from system services (crio, others) except for kubelet aren't being collected
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NodeAssignee: Joel Smith <joelsmith>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, jokerman, rphillips
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1811713 1811720 (view as bug list) Environment:
Last Closed: 2020-08-04 18:01:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1811713    

Description Clayton Coleman 2020-02-21 20:34:27 UTC
There should be container_memory_usage_bytes (and all the other cadvisor metrics) for all services in system.slice, but 4.5 clusters don't appear to be showing them.  Only id="/" and id="/system.slice/crio.service" are showing up, but crio (for instance) is not.

cadvisor should be reporting all service data, not just containers and kubelet.

Must be fixed for release and then backported as far back as it impacts.

Comment 1 Clayton Coleman 2020-02-21 20:36:47 UTC
It looks like the Kubelet's /metrics/cadvisor is only reporting kubelet and /, so likely a node bug or a misconfiguration (or just a regression upstream).

Comment 2 Ryan Phillips 2020-02-21 21:12:03 UTC
Peter: Can you take a look?

My quick scan of the code revealed metrics.Register() only gets called if the metrics http endpoint is enabled.

https://github.com/cri-o/cri-o/blob/6ac55ea2ee72bb6313d463ad1726bfbad966197d/server/server.go#L537

Comment 3 Peter Hunt 2020-02-26 22:28:35 UTC
I am suspicious, though not yet sure, that changes in this PR https://github.com/cri-o/cri-o/pull/3333 will fix this.

I am pretty sure CRI-O was reporting no stats for systemd cgroups. I don't know how long this has been happening, but since we only recently moved away from cadvisor nabbing the stats on its own, I can imagine that's why we're only seeing this now.

Comment 4 Peter Hunt 2020-03-04 21:28:02 UTC
My above suspicion was very wrong, because my understanding of the bug was wrong. 

Joel, can you help me out on this one?

Comment 5 Joel Smith 2020-03-05 01:03:24 UTC
I think I figured out what is wrong. In the name of efficiency, an upstream PR removed stats for all processes except the kubelet and the runtime. Unfortunately, that PR was docker-specific and so no stats show up in cadvisor when using crio. I think I have a fix for this. I've posted an upstream PR at https://github.com/kubernetes/kubernetes/pull/88823

It would probably be a good idea if we also update crio to write a PID file to /var/run/crio.pid so that the kubelet can reliably find the crio process and its cgroup info.

Comment 6 Clayton Coleman 2020-03-05 21:58:25 UTC
This was a significant regression in the observability of our platform.  Until we have a replacement, we need to preserve the ability to see cAdvisor metrics for most/all system services.  I can believe there are efficiency wins, but the core metrics at least (cpu, memory, io, network) I would expect for every scope/slice in /system.slice, at least.

Comment 7 Joel Smith 2020-03-06 19:49:24 UTC
The PR from #5 and its approach are obsolete and I have closed the PR. Instead I have a better way that hopefully gets the functionality described by #6.

https://github.com/openshift/machine-config-operator/pull/1540

Comment 8 Ryan Phillips 2020-03-06 19:53:24 UTC
Moving to 4.5.

Comment 13 errata-xmlrpc 2020-08-04 18:01:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409