Bug 1694766

Summary: No CPU metrics for non-pod services on a node
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: anpicker, aos-bugs, erooth, fbranczy, jokerman, mloibl, mmccomas, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:46:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints none

Description Clayton Coleman 2019-04-01 15:24:24 UTC
The query


historically returned CPU metrics about other services on the host, like kubelet, the kernel, etc.

I don't see them on recent builds.

This is a release blocker because CPU metrics about host processes is a key part of debugging and monitoring infrastructure cost.

Assigning to monitoring to start, might be a kubelet/cadvisor issue.

Comment 1 Clayton Coleman 2019-04-01 15:26:11 UTC
Might be partially fixed by https://github.com/openshift/machine-config-operator/pull/581, but there are some services with accounting on already that aren't showing up.

Comment 2 Clayton Coleman 2019-04-01 15:26:59 UTC
As part of any fix please add an origin e2e suite that is part of conformance that verifies that node level non-pod CPU metrics show up (and pod CPU metrics) by adding to one of the existing "prometheus metrics should be retrieved" e2e tests so this doesn't regress in the future.

Comment 3 Andrew Pickering 2019-04-02 05:42:10 UTC
Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns no datapoints with a 4.1 cluster launched today.

Comment 4 Andrew Pickering 2019-04-02 05:54:56 UTC
@Lucas Would you be able to take a look?

BTW, I see that https://github.com/openshift/machine-config-operator/pull/581 has now merged.

Comment 8 Frederic Branczyk 2019-04-09 18:21:24 UTC
First PR to remove the too aggressive dropping is out: https://github.com/coreos/prometheus-operator/pull/2545

Comment 9 Frederic Branczyk 2019-04-11 11:34:28 UTC
Now having this trickle down into the cluster-monitoring-operator. Moving to POST.

Comment 10 Frederic Branczyk 2019-04-11 12:10:58 UTC
The final PR to have this trickle down into the cluster-monitoring stack has been opened: https://github.com/openshift/cluster-monitoring-operator/pull/319

Comment 11 Frederic Branczyk 2019-04-15 14:19:23 UTC
The above PR was merged, and I've verified again that it does work on new clusters. Adding an e2e test now.

Comment 12 Frederic Branczyk 2019-04-15 16:32:59 UTC
PR to add the additional test to prevent this regression in the future has been opened: https://github.com/openshift/origin/pull/22575

Comment 13 Frederic Branczyk 2019-04-18 07:12:39 UTC
Both the fix and e2e test to catch regressions have been fixed. Moving to modified.

Comment 15 Junqi Zhao 2019-04-23 08:16:20 UTC
Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints now
payload: 4.0.0-0.nightly-2019-04-20-175518

Comment 16 Junqi Zhao 2019-04-23 08:16:46 UTC
Created attachment 1557483 [details]
container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints

Comment 18 errata-xmlrpc 2019-06-04 10:46:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.