Bug 1669718 - Different metrics observed in CPU metrics.
Summary: Different metrics observed in CPU metrics.
Keywords:
Status: CLOSED DUPLICATE of bug 1669410
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1664187
TreeView+ depends on / blocked
 
Reported: 2019-01-26 10:42 UTC by Saurabh Sadhale
Modified: 2019-05-06 13:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-04 09:12:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
node grafana UI (134.88 KB, image/png)
2019-01-28 09:34 UTC, Junqi Zhao
no flags Details
oc adm top node output (37.11 KB, image/png)
2019-01-28 09:37 UTC, Junqi Zhao
no flags Details

Comment 3 Junqi Zhao 2019-01-28 09:34:16 UTC
Created attachment 1524171 [details]
node grafana UI

Comment 4 Junqi Zhao 2019-01-28 09:37:06 UTC
Created attachment 1524174 [details]
oc adm top node output

There is not large CPU usage gap from my environment, the grafana reports the CPU usage for node is about 10%-11%, oc adm top node is 11%

I think the difference is allowed from grafana and oc adm top node output, if there is not large CPU usage gap from the two sides

Comment 5 Frederic Branczyk 2019-02-04 16:33:31 UTC
There may be a slight difference as Grafana currently uses the node exporter metrics, which are about the entire host, whereas the prometheus-adapter (which is what serves the kubectl top nodes request) currently uses the sum of all cpu used by the cgroup hierarchy, which means non-cgroup processes are not taken into account, which could cause a very slight inconsistency. I think this is fairly minor so I'm not sure we'll get to fix this, but adding it for 4.0 and we'll see if we can make it in time.

Comment 6 Eric Rich 2019-02-18 17:50:47 UTC
This issue is VERY similar to https://bugzilla.redhat.com/show_bug.cgi?id=1456856 which caused a lot of consternation with our customers who used this product to monitor their OpenShift environment and or applications. 

While those of us who understand what and how things are being collected, it's easy to explain away, however, if your expecting one thing and get something different you are dismayed by the results. You can fix this with docs (by clearly detailing what people will see and why), however, to date I have never seen this be done well. The reason for that is not because it can't be documented or documented well, its more in documenting this in this way is complicated and hard to explain (on paper). In short, we need the tool to do what people expect (and document that as simple as possible).

Comment 7 Junqi Zhao 2019-02-21 06:41:42 UTC
(In reply to Frederic Branczyk from comment #5)
> There may be a slight difference as Grafana currently uses the node exporter
> metrics, which are about the entire host, whereas the prometheus-adapter
> (which is what serves the kubectl top nodes request) currently uses the sum
> of all cpu used by the cgroup hierarchy, which means non-cgroup processes
> are not taken into account, which could cause a very slight inconsistency. I
> think this is fairly minor so I'm not sure we'll get to fix this, but adding
> it for 4.0 and we'll see if we can make it in time.

I sugget to set the target as 4.0.z if we won't fix it in 4.0

Comment 8 Frederic Branczyk 2019-02-28 07:47:26 UTC
As already mentioned before, this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1669410, which has been fixed as of https://github.com/openshift/cluster-monitoring-operator/pull/272. Moving to modified (but also feel free to mark as duplicate as I just did the same for 1669410).

Comment 19 Frederic Branczyk 2019-04-04 08:35:05 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1669410 has been verified. Moving this to modified, but feel free to close as duplicate.

Comment 20 Junqi Zhao 2019-04-04 09:12:49 UTC

*** This bug has been marked as a duplicate of bug 1669410 ***


Note You need to log in before you can comment on or make changes to this bug.