Bug 1694766 - No CPU metrics for non-pod services on a node
Summary: No CPU metrics for non-pod services on a node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-01 15:24 UTC by Clayton Coleman
Modified: 2019-06-04 10:46 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:46:44 UTC
Target Upstream Version:


Attachments (Terms of Use)
container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints (134.30 KB, image/png)
2019-04-23 08:16 UTC, Junqi Zhao
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:46:53 UTC

Description Clayton Coleman 2019-04-01 15:24:24 UTC
The query

container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} 

historically returned CPU metrics about other services on the host, like kubelet, the kernel, etc.

I don't see them on recent builds.

This is a release blocker because CPU metrics about host processes is a key part of debugging and monitoring infrastructure cost.

Assigning to monitoring to start, might be a kubelet/cadvisor issue.

Comment 1 Clayton Coleman 2019-04-01 15:26:11 UTC
Might be partially fixed by https://github.com/openshift/machine-config-operator/pull/581, but there are some services with accounting on already that aren't showing up.

Comment 2 Clayton Coleman 2019-04-01 15:26:59 UTC
As part of any fix please add an origin e2e suite that is part of conformance that verifies that node level non-pod CPU metrics show up (and pod CPU metrics) by adding to one of the existing "prometheus metrics should be retrieved" e2e tests so this doesn't regress in the future.

Comment 3 Andrew Pickering 2019-04-02 05:42:10 UTC
Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns no datapoints with a 4.1 cluster launched today.

Comment 4 Andrew Pickering 2019-04-02 05:54:56 UTC
@Lucas Would you be able to take a look?

BTW, I see that https://github.com/openshift/machine-config-operator/pull/581 has now merged.

Comment 8 Frederic Branczyk 2019-04-09 18:21:24 UTC
First PR to remove the too aggressive dropping is out: https://github.com/coreos/prometheus-operator/pull/2545

Comment 9 Frederic Branczyk 2019-04-11 11:34:28 UTC
Now having this trickle down into the cluster-monitoring-operator. Moving to POST.

Comment 10 Frederic Branczyk 2019-04-11 12:10:58 UTC
The final PR to have this trickle down into the cluster-monitoring stack has been opened: https://github.com/openshift/cluster-monitoring-operator/pull/319

Comment 11 Frederic Branczyk 2019-04-15 14:19:23 UTC
The above PR was merged, and I've verified again that it does work on new clusters. Adding an e2e test now.

Comment 12 Frederic Branczyk 2019-04-15 16:32:59 UTC
PR to add the additional test to prevent this regression in the future has been opened: https://github.com/openshift/origin/pull/22575

Comment 13 Frederic Branczyk 2019-04-18 07:12:39 UTC
Both the fix and e2e test to catch regressions have been fixed. Moving to modified.

Comment 15 Junqi Zhao 2019-04-23 08:16:20 UTC
Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints now
payload: 4.0.0-0.nightly-2019-04-20-175518

Comment 16 Junqi Zhao 2019-04-23 08:16:46 UTC
Created attachment 1557483 [details]
container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints

Comment 18 errata-xmlrpc 2019-06-04 10:46:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.