Bug 1806027
Summary: | cAdvisor metrics from system services (crio, others) except for kubelet aren't being collected | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
Component: | Node | Assignee: | Joel Smith <joelsmith> | |
Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.4 | CC: | aos-bugs, jokerman, rphillips | |
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1811713 1811720 (view as bug list) | Environment: | ||
Last Closed: | 2020-08-04 18:01:33 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1811713 |
Description
Clayton Coleman
2020-02-21 20:34:27 UTC
It looks like the Kubelet's /metrics/cadvisor is only reporting kubelet and /, so likely a node bug or a misconfiguration (or just a regression upstream). Peter: Can you take a look? My quick scan of the code revealed metrics.Register() only gets called if the metrics http endpoint is enabled. https://github.com/cri-o/cri-o/blob/6ac55ea2ee72bb6313d463ad1726bfbad966197d/server/server.go#L537 I am suspicious, though not yet sure, that changes in this PR https://github.com/cri-o/cri-o/pull/3333 will fix this. I am pretty sure CRI-O was reporting no stats for systemd cgroups. I don't know how long this has been happening, but since we only recently moved away from cadvisor nabbing the stats on its own, I can imagine that's why we're only seeing this now. My above suspicion was very wrong, because my understanding of the bug was wrong. Joel, can you help me out on this one? I think I figured out what is wrong. In the name of efficiency, an upstream PR removed stats for all processes except the kubelet and the runtime. Unfortunately, that PR was docker-specific and so no stats show up in cadvisor when using crio. I think I have a fix for this. I've posted an upstream PR at https://github.com/kubernetes/kubernetes/pull/88823 It would probably be a good idea if we also update crio to write a PID file to /var/run/crio.pid so that the kubelet can reliably find the crio process and its cgroup info. This was a significant regression in the observability of our platform. Until we have a replacement, we need to preserve the ability to see cAdvisor metrics for most/all system services. I can believe there are efficiency wins, but the core metrics at least (cpu, memory, io, network) I would expect for every scope/slice in /system.slice, at least. The PR from #5 and its approach are obsolete and I have closed the PR. Instead I have a better way that hopefully gets the functionality described by #6. https://github.com/openshift/machine-config-operator/pull/1540 Moving to 4.5. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |