Bug 1508520 - CPU Utilization values are wrong [NEEDINFO]
Summary: CPU Utilization values are wrong
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: web-admin-tendrl-monitoring-integration
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Ankush Behl
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-01 15:33 UTC by Filip Balák
Modified: 2018-08-09 18:08 UTC (History)
5 users (show)

Fixed In Version: tendrl-monitoring-integration-1.5.4-3.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-18 04:39:36 UTC
Target Upstream Version:
fbalak: needinfo? (nthomas)


Attachments (Terms of Use)
CPU Utilization panel (48.16 KB, image/png)
2017-11-01 15:33 UTC, Filip Balák
no flags Details
CPU Utilization panel after modification (55.85 KB, image/png)
2017-11-13 15:57 UTC, Filip Balák
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:3478 normal SHIPPED_LIVE RHGS Web Administration packages 2017-12-18 09:34:49 UTC
Github Tendrl monitoring-integration issues 27 None None None 2017-11-01 15:33:23 UTC
Github https://github.com/Tendrl monitoring-integration issues 229 None None None 2017-11-07 07:08:44 UTC
Red Hat Bugzilla 1358461 None None None Never
Red Hat Bugzilla 1614486 None None None Never

Internal Links: 1358461 1614486

Description Filip Balák 2017-11-01 15:33:23 UTC
Created attachment 1346601 [details]
CPU Utilization panel

Description of problem:
CPU utilization values are reported in wrong/misleading way. In CPU Utilization panel in Hosts dashboard is chart with only one measured value labeled by the host it measures. It seems to aggregate all memory types but the value in this chart isn't equal with any value reported by `top` command. All memory types should have reported their own value in chart and these values should be correctly labeled.

Version-Release number of selected component (if applicable):
Version-Release number of selected component (if applicable):
tendrl-api-httpd-1.5.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-node-agent-1.5.3-3.el7rhgs.noarch
tendrl-ui-1.5.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.5.3-2.el7rhgs.noarch
tendrl-notifier-1.5.3-1.el7rhgs.noarch
tendrl-ansible-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.3-1.el7rhgs.noarch
tendrl-api-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.3-2.el7rhgs.noarch
glusterfs-3.8.4-50.el7rhgs.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Import cluster with volume.
2. On one of the nodes install `stress` tool.
3. Run the following command on monitored machine:

   stress --cpu ${N} --vm 1 --vm-bytes ${M}G

   where N is the number of cpu of particular machine,
   and M is roughly half of the total memory of the machine (in GB).

   The intended purpose of this command is to:

   * utilize all cpu on each machine to 100%
   * while having significant percentage of system cpu cycles

4. Wait for new data to be collected and visualized in Grafana.
5. Check CPU Utilization panel in Hosts dashboard.

Actual results:
There is only one reported value in chart but it does not reflect CPU Utilization correctly.

My CPU utilization reported by `top`:
%Cpu(s): 66.8 us, 33.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st

CPU reported by Grafana: 50%


Expected results:
There should be correctly reported CPU utilization for:

user    : time running un-niced user processes
system  : time running kernel processes
nice    : time running niced user processes
idle    : time spent in the kernel idle handler
IO-wait : time waiting for I/O completion

Additional info:

Comment 1 Nishanth Thomas 2017-11-06 17:47:50 UTC
What is incorrect here?
You mean the current calculation is wrong or its not showing all the values separately on the graph(user, system, nice etc?)

Comment 2 Filip Balák 2017-11-07 12:07:25 UTC
There should be shown all the values separately. There shouldn't be only one aggregated value for CPU utilization.

Comment 3 Filip Balák 2017-11-13 15:48:10 UTC
I see that CPU utilization is divided into user and system memory. Are these the only CPU metrics that user needs to know related to Gluster? Is for gluster use-case unnecessary to show nice, idle and IO-wait values?

Comment 4 Filip Balák 2017-11-13 15:56:39 UTC
I also see that the reported metrics are not shown cumulatively (for example https://grafana.com/dashboards/203). In my chart are shown values user=72% system=27. From chart it is not clear that almost all CPU resources are consumed. The threshold line is also not breached this way.

Comment 5 Filip Balák 2017-11-13 15:57:40 UTC
Created attachment 1351625 [details]
CPU Utilization panel after modification

Comment 7 Ankush Behl 2017-11-18 06:18:15 UTC
@fbalak@redhat.com we are only showing 2 metrics(user and system) as they are the actual ones that matter to end user or admins.

>> Why are we not showing the cumulative?

Earlier we were showing the cumulative/aggregated metric but we removed it because Cpu only has user, system, nice, idle and IO-wait value and if we show "total used" the end user can get confused upon what the total used it showing so we removed the total used from memory and CPU graph. Also, the threshold is still getting breached at the cumulative value of user and system but that cumulative value I have kept hidden.

Also if you see the live example provided by grafana it doesn't have cumulative value. http://play.grafana.org/dashboard/db/prometheus-demo-dashboard?refresh=5m&orgId=1

Comment 11 errata-xmlrpc 2017-12-18 04:39:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478


Note You need to log in before you can comment on or make changes to this bug.