1806027 – cAdvisor metrics from system services (crio, others) except for kubelet aren't being collected

Bug 1806027 - cAdvisor metrics from system services (crio, others) except for kubelet aren't being collected

Summary: cAdvisor metrics from system services (crio, others) except for kubelet aren'...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Joel Smith
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1811713
TreeView+	depends on / blocked

Reported:	2020-02-21 20:34 UTC by Clayton Coleman
Modified:	2020-08-04 18:01 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1811713 1811720 (view as bug list)
Environment:
Last Closed:	2020-08-04 18:01:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1540	0	None	closed	Bug 1806027: Specify cgroups in kubelet.conf so cAdvisor stats will be tracked	2020-09-01 11:16:22 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-08-04 18:01:36 UTC

Description Clayton Coleman 2020-02-21 20:34:27 UTC

There should be container_memory_usage_bytes (and all the other cadvisor metrics) for all services in system.slice, but 4.5 clusters don't appear to be showing them.  Only id="/" and id="/system.slice/crio.service" are showing up, but crio (for instance) is not.

cadvisor should be reporting all service data, not just containers and kubelet.

Must be fixed for release and then backported as far back as it impacts.

Comment 1 Clayton Coleman 2020-02-21 20:36:47 UTC

It looks like the Kubelet's /metrics/cadvisor is only reporting kubelet and /, so likely a node bug or a misconfiguration (or just a regression upstream).

Comment 2 Ryan Phillips 2020-02-21 21:12:03 UTC

Peter: Can you take a look?

My quick scan of the code revealed metrics.Register() only gets called if the metrics http endpoint is enabled.

https://github.com/cri-o/cri-o/blob/6ac55ea2ee72bb6313d463ad1726bfbad966197d/server/server.go#L537

Comment 3 Peter Hunt 2020-02-26 22:28:35 UTC

I am suspicious, though not yet sure, that changes in this PR https://github.com/cri-o/cri-o/pull/3333 will fix this.

I am pretty sure CRI-O was reporting no stats for systemd cgroups. I don't know how long this has been happening, but since we only recently moved away from cadvisor nabbing the stats on its own, I can imagine that's why we're only seeing this now.

Comment 4 Peter Hunt 2020-03-04 21:28:02 UTC

My above suspicion was very wrong, because my understanding of the bug was wrong. 

Joel, can you help me out on this one?

Comment 5 Joel Smith 2020-03-05 01:03:24 UTC

I think I figured out what is wrong. In the name of efficiency, an upstream PR removed stats for all processes except the kubelet and the runtime. Unfortunately, that PR was docker-specific and so no stats show up in cadvisor when using crio. I think I have a fix for this. I've posted an upstream PR at https://github.com/kubernetes/kubernetes/pull/88823

It would probably be a good idea if we also update crio to write a PID file to /var/run/crio.pid so that the kubelet can reliably find the crio process and its cgroup info.

Comment 6 Clayton Coleman 2020-03-05 21:58:25 UTC

This was a significant regression in the observability of our platform.  Until we have a replacement, we need to preserve the ability to see cAdvisor metrics for most/all system services.  I can believe there are efficiency wins, but the core metrics at least (cpu, memory, io, network) I would expect for every scope/slice in /system.slice, at least.

Comment 7 Joel Smith 2020-03-06 19:49:24 UTC

The PR from #5 and its approach are obsolete and I have closed the PR. Instead I have a better way that hopefully gets the functionality described by #6.

https://github.com/openshift/machine-config-operator/pull/1540

Comment 8 Ryan Phillips 2020-03-06 19:53:24 UTC

Moving to 4.5.

Comment 13 errata-xmlrpc 2020-08-04 18:01:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.