Bug 1475034 - Metrics chart reporting 74000 Millicores for an app running on a node with only 8 cores
Metrics chart reporting 74000 Millicores for an app running on a node with on...
Status: CLOSED INSUFFICIENT_DATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Metrics (Show other bugs)
3.3.1
Unspecified Unspecified
medium Severity medium
: ---
: 3.3.1
Assigned To: Solly Ross
Junqi Zhao
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-25 17:51 EDT by Eric Jones
Modified: 2017-11-03 09:43 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-11-03 09:43:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Eric Jones 2017-07-25 17:51:43 EDT
Description of problem:
application with several replications running just fine suddenly has metrics reporting significantly more cores that is possible (node has 8 cores, app reported 74,000 millicores).


Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.3.1.11

Additional info:
Attaching files shortly
Comment 2 Matt Wringe 2017-07-26 13:50:01 EDT
@sross: it looks like Heapster is using 15s for its interval, and I believe at this interval we can sometimes get strange cpu usage results back. Is this something we have seen before? A very large cpu spike which is nonsense.
Comment 3 Solly Ross 2017-07-28 15:24:29 EDT
those logs do not look like a healthy Heapster :-/

I'd try switching to an interval of 30s, as well as checking what the summary endpoint says, and what happens if you switch to using the summary source (`--source=kubernetes.summary_api:...` instead of `--source=kubernetes:...`.

We've seen spikes like that due to bad (non-monotonically increasing) CPU metrics and overflow, or occasionally due to bad metrics coming from Kubelet/cAdvisor, but I thought we'd fixed most of those issues.

Note You need to log in before you can comment on or make changes to this bug.