Bug 1669718

Summary: Different metrics observed in CPU metrics.
Product: OpenShift Container Platform Reporter: Saurabh Sadhale <ssadhale>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: anpicker, erich, fbranczy, minden, surbania
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-04 09:12:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1664187    
Attachments:
Description Flags
node grafana UI
none
oc adm top node output none

Comment 3 Junqi Zhao 2019-01-28 09:34:16 UTC
Created attachment 1524171 [details]
node grafana UI

Comment 4 Junqi Zhao 2019-01-28 09:37:06 UTC
Created attachment 1524174 [details]
oc adm top node output

There is not large CPU usage gap from my environment, the grafana reports the CPU usage for node is about 10%-11%, oc adm top node is 11%

I think the difference is allowed from grafana and oc adm top node output, if there is not large CPU usage gap from the two sides

Comment 5 Frederic Branczyk 2019-02-04 16:33:31 UTC
There may be a slight difference as Grafana currently uses the node exporter metrics, which are about the entire host, whereas the prometheus-adapter (which is what serves the kubectl top nodes request) currently uses the sum of all cpu used by the cgroup hierarchy, which means non-cgroup processes are not taken into account, which could cause a very slight inconsistency. I think this is fairly minor so I'm not sure we'll get to fix this, but adding it for 4.0 and we'll see if we can make it in time.

Comment 6 Eric Rich 2019-02-18 17:50:47 UTC
This issue is VERY similar to https://bugzilla.redhat.com/show_bug.cgi?id=1456856 which caused a lot of consternation with our customers who used this product to monitor their OpenShift environment and or applications. 

While those of us who understand what and how things are being collected, it's easy to explain away, however, if your expecting one thing and get something different you are dismayed by the results. You can fix this with docs (by clearly detailing what people will see and why), however, to date I have never seen this be done well. The reason for that is not because it can't be documented or documented well, its more in documenting this in this way is complicated and hard to explain (on paper). In short, we need the tool to do what people expect (and document that as simple as possible).

Comment 7 Junqi Zhao 2019-02-21 06:41:42 UTC
(In reply to Frederic Branczyk from comment #5)
> There may be a slight difference as Grafana currently uses the node exporter
> metrics, which are about the entire host, whereas the prometheus-adapter
> (which is what serves the kubectl top nodes request) currently uses the sum
> of all cpu used by the cgroup hierarchy, which means non-cgroup processes
> are not taken into account, which could cause a very slight inconsistency. I
> think this is fairly minor so I'm not sure we'll get to fix this, but adding
> it for 4.0 and we'll see if we can make it in time.

I sugget to set the target as 4.0.z if we won't fix it in 4.0

Comment 8 Frederic Branczyk 2019-02-28 07:47:26 UTC
As already mentioned before, this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1669410, which has been fixed as of https://github.com/openshift/cluster-monitoring-operator/pull/272. Moving to modified (but also feel free to mark as duplicate as I just did the same for 1669410).

Comment 19 Frederic Branczyk 2019-04-04 08:35:05 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1669410 has been verified. Moving this to modified, but feel free to close as duplicate.

Comment 20 Junqi Zhao 2019-04-04 09:12:49 UTC

*** This bug has been marked as a duplicate of bug 1669410 ***