Bug 1694766

Summary:

No CPU metrics for non-pod services on a node

Product:

OpenShift Container Platform

Reporter:

Clayton Coleman <ccoleman>

Component:

Monitoring

Assignee:

Frederic Branczyk <fbranczy>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.1.0

CC:

anpicker, aos-bugs, erooth, fbranczy, jokerman, mloibl, mmccomas, pkrupa, surbania

Target Milestone:

---

Target Release:

4.1.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-04 10:46:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints	none

Description Clayton Coleman 2019-04-01 15:24:24 UTC

The query

container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} 

historically returned CPU metrics about other services on the host, like kubelet, the kernel, etc.

I don't see them on recent builds.

This is a release blocker because CPU metrics about host processes is a key part of debugging and monitoring infrastructure cost.

Assigning to monitoring to start, might be a kubelet/cadvisor issue.

Comment 1 Clayton Coleman 2019-04-01 15:26:11 UTC

Might be partially fixed by https://github.com/openshift/machine-config-operator/pull/581, but there are some services with accounting on already that aren't showing up.

Comment 2 Clayton Coleman 2019-04-01 15:26:59 UTC

As part of any fix please add an origin e2e suite that is part of conformance that verifies that node level non-pod CPU metrics show up (and pod CPU metrics) by adding to one of the existing "prometheus metrics should be retrieved" e2e tests so this doesn't regress in the future.

Comment 3 Andrew Pickering 2019-04-02 05:42:10 UTC

Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns no datapoints with a 4.1 cluster launched today.

Comment 4 Andrew Pickering 2019-04-02 05:54:56 UTC

@Lucas Would you be able to take a look?

BTW, I see that https://github.com/openshift/machine-config-operator/pull/581 has now merged.

Comment 8 Frederic Branczyk 2019-04-09 18:21:24 UTC

First PR to remove the too aggressive dropping is out: https://github.com/coreos/prometheus-operator/pull/2545

Comment 9 Frederic Branczyk 2019-04-11 11:34:28 UTC

Now having this trickle down into the cluster-monitoring-operator. Moving to POST.

Comment 10 Frederic Branczyk 2019-04-11 12:10:58 UTC

The final PR to have this trickle down into the cluster-monitoring stack has been opened: https://github.com/openshift/cluster-monitoring-operator/pull/319

Comment 11 Frederic Branczyk 2019-04-15 14:19:23 UTC

The above PR was merged, and I've verified again that it does work on new clusters. Adding an e2e test now.

Comment 12 Frederic Branczyk 2019-04-15 16:32:59 UTC

PR to add the additional test to prevent this regression in the future has been opened: https://github.com/openshift/origin/pull/22575

Comment 13 Frederic Branczyk 2019-04-18 07:12:39 UTC

Both the fix and e2e test to catch regressions have been fixed. Moving to modified.

Comment 15 Junqi Zhao 2019-04-23 08:16:20 UTC

Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints now
payload: 4.0.0-0.nightly-2019-04-20-175518

Comment 16 Junqi Zhao 2019-04-23 08:16:46 UTC

Created attachment 1557483 [details]
container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints

Comment 18 errata-xmlrpc 2019-06-04 10:46:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758