1694766 – No CPU metrics for non-pod services on a node

Bug 1694766 - No CPU metrics for non-pod services on a node

Summary: No CPU metrics for non-pod services on a node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-01 15:24 UTC by Clayton Coleman
Modified:	2019-06-04 10:46 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:46:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
*container_cpu_usage_seconds_total{id!~"/kubepods.slice/."} returns datapoints** (134.30 KB, image/png) 2019-04-23 08:16 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:46:53 UTC

Description Clayton Coleman 2019-04-01 15:24:24 UTC

The query

container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} 

historically returned CPU metrics about other services on the host, like kubelet, the kernel, etc.

I don't see them on recent builds.

This is a release blocker because CPU metrics about host processes is a key part of debugging and monitoring infrastructure cost.

Assigning to monitoring to start, might be a kubelet/cadvisor issue.

Comment 1 Clayton Coleman 2019-04-01 15:26:11 UTC

Might be partially fixed by https://github.com/openshift/machine-config-operator/pull/581, but there are some services with accounting on already that aren't showing up.

Comment 2 Clayton Coleman 2019-04-01 15:26:59 UTC

As part of any fix please add an origin e2e suite that is part of conformance that verifies that node level non-pod CPU metrics show up (and pod CPU metrics) by adding to one of the existing "prometheus metrics should be retrieved" e2e tests so this doesn't regress in the future.

Comment 3 Andrew Pickering 2019-04-02 05:42:10 UTC

Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns no datapoints with a 4.1 cluster launched today.

Comment 4 Andrew Pickering 2019-04-02 05:54:56 UTC

@Lucas Would you be able to take a look?

BTW, I see that https://github.com/openshift/machine-config-operator/pull/581 has now merged.

Comment 8 Frederic Branczyk 2019-04-09 18:21:24 UTC

First PR to remove the too aggressive dropping is out: https://github.com/coreos/prometheus-operator/pull/2545

Comment 9 Frederic Branczyk 2019-04-11 11:34:28 UTC

Now having this trickle down into the cluster-monitoring-operator. Moving to POST.

Comment 10 Frederic Branczyk 2019-04-11 12:10:58 UTC

The final PR to have this trickle down into the cluster-monitoring stack has been opened: https://github.com/openshift/cluster-monitoring-operator/pull/319

Comment 11 Frederic Branczyk 2019-04-15 14:19:23 UTC

The above PR was merged, and I've verified again that it does work on new clusters. Adding an e2e test now.

Comment 12 Frederic Branczyk 2019-04-15 16:32:59 UTC

PR to add the additional test to prevent this regression in the future has been opened: https://github.com/openshift/origin/pull/22575

Comment 13 Frederic Branczyk 2019-04-18 07:12:39 UTC

Both the fix and e2e test to catch regressions have been fixed. Moving to modified.

Comment 15 Junqi Zhao 2019-04-23 08:16:20 UTC

Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints now
payload: 4.0.0-0.nightly-2019-04-20-175518

Comment 16 Junqi Zhao 2019-04-23 08:16:46 UTC

Created attachment 1557483 [details]
container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints

Comment 18 errata-xmlrpc 2019-06-04 10:46:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.