Hide Forgot
The query container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} historically returned CPU metrics about other services on the host, like kubelet, the kernel, etc. I don't see them on recent builds. This is a release blocker because CPU metrics about host processes is a key part of debugging and monitoring infrastructure cost. Assigning to monitoring to start, might be a kubelet/cadvisor issue.
Might be partially fixed by https://github.com/openshift/machine-config-operator/pull/581, but there are some services with accounting on already that aren't showing up.
As part of any fix please add an origin e2e suite that is part of conformance that verifies that node level non-pod CPU metrics show up (and pod CPU metrics) by adding to one of the existing "prometheus metrics should be retrieved" e2e tests so this doesn't regress in the future.
Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns no datapoints with a 4.1 cluster launched today.
@Lucas Would you be able to take a look? BTW, I see that https://github.com/openshift/machine-config-operator/pull/581 has now merged.
First PR to remove the too aggressive dropping is out: https://github.com/coreos/prometheus-operator/pull/2545
Now having this trickle down into the cluster-monitoring-operator. Moving to POST.
The final PR to have this trickle down into the cluster-monitoring stack has been opened: https://github.com/openshift/cluster-monitoring-operator/pull/319
The above PR was merged, and I've verified again that it does work on new clusters. Adding an e2e test now.
PR to add the additional test to prevent this regression in the future has been opened: https://github.com/openshift/origin/pull/22575
Both the fix and e2e test to catch regressions have been fixed. Moving to modified.
Confirmed that container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints now payload: 4.0.0-0.nightly-2019-04-20-175518
Created attachment 1557483 [details] container_cpu_usage_seconds_total{id!~"/kubepods.slice/.*"} returns datapoints
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758