Hide Forgot
Created attachment 1524171 [details] node grafana UI
Created attachment 1524174 [details] oc adm top node output There is not large CPU usage gap from my environment, the grafana reports the CPU usage for node is about 10%-11%, oc adm top node is 11% I think the difference is allowed from grafana and oc adm top node output, if there is not large CPU usage gap from the two sides
There may be a slight difference as Grafana currently uses the node exporter metrics, which are about the entire host, whereas the prometheus-adapter (which is what serves the kubectl top nodes request) currently uses the sum of all cpu used by the cgroup hierarchy, which means non-cgroup processes are not taken into account, which could cause a very slight inconsistency. I think this is fairly minor so I'm not sure we'll get to fix this, but adding it for 4.0 and we'll see if we can make it in time.
This issue is VERY similar to https://bugzilla.redhat.com/show_bug.cgi?id=1456856 which caused a lot of consternation with our customers who used this product to monitor their OpenShift environment and or applications. While those of us who understand what and how things are being collected, it's easy to explain away, however, if your expecting one thing and get something different you are dismayed by the results. You can fix this with docs (by clearly detailing what people will see and why), however, to date I have never seen this be done well. The reason for that is not because it can't be documented or documented well, its more in documenting this in this way is complicated and hard to explain (on paper). In short, we need the tool to do what people expect (and document that as simple as possible).
(In reply to Frederic Branczyk from comment #5) > There may be a slight difference as Grafana currently uses the node exporter > metrics, which are about the entire host, whereas the prometheus-adapter > (which is what serves the kubectl top nodes request) currently uses the sum > of all cpu used by the cgroup hierarchy, which means non-cgroup processes > are not taken into account, which could cause a very slight inconsistency. I > think this is fairly minor so I'm not sure we'll get to fix this, but adding > it for 4.0 and we'll see if we can make it in time. I sugget to set the target as 4.0.z if we won't fix it in 4.0
As already mentioned before, this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1669410, which has been fixed as of https://github.com/openshift/cluster-monitoring-operator/pull/272. Moving to modified (but also feel free to mark as duplicate as I just did the same for 1669410).
https://bugzilla.redhat.com/show_bug.cgi?id=1669410 has been verified. Moving this to modified, but feel free to close as duplicate.
*** This bug has been marked as a duplicate of bug 1669410 ***