Created attachment 1346601 [details] CPU Utilization panel Description of problem: CPU utilization values are reported in wrong/misleading way. In CPU Utilization panel in Hosts dashboard is chart with only one measured value labeled by the host it measures. It seems to aggregate all memory types but the value in this chart isn't equal with any value reported by `top` command. All memory types should have reported their own value in chart and these values should be correctly labeled. Version-Release number of selected component (if applicable): Version-Release number of selected component (if applicable): tendrl-api-httpd-1.5.3-2.el7rhgs.noarch tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch tendrl-selinux-1.5.3-2.el7rhgs.noarch tendrl-node-agent-1.5.3-3.el7rhgs.noarch tendrl-ui-1.5.3-2.el7rhgs.noarch tendrl-grafana-plugins-1.5.3-2.el7rhgs.noarch tendrl-notifier-1.5.3-1.el7rhgs.noarch tendrl-ansible-1.5.3-2.el7rhgs.noarch tendrl-commons-1.5.3-1.el7rhgs.noarch tendrl-api-1.5.3-2.el7rhgs.noarch tendrl-monitoring-integration-1.5.3-2.el7rhgs.noarch glusterfs-3.8.4-50.el7rhgs.x86_64 How reproducible: 100% Steps to Reproduce: 1. Import cluster with volume. 2. On one of the nodes install `stress` tool. 3. Run the following command on monitored machine: stress --cpu ${N} --vm 1 --vm-bytes ${M}G where N is the number of cpu of particular machine, and M is roughly half of the total memory of the machine (in GB). The intended purpose of this command is to: * utilize all cpu on each machine to 100% * while having significant percentage of system cpu cycles 4. Wait for new data to be collected and visualized in Grafana. 5. Check CPU Utilization panel in Hosts dashboard. Actual results: There is only one reported value in chart but it does not reflect CPU Utilization correctly. My CPU utilization reported by `top`: %Cpu(s): 66.8 us, 33.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st CPU reported by Grafana: 50% Expected results: There should be correctly reported CPU utilization for: user : time running un-niced user processes system : time running kernel processes nice : time running niced user processes idle : time spent in the kernel idle handler IO-wait : time waiting for I/O completion Additional info:
What is incorrect here? You mean the current calculation is wrong or its not showing all the values separately on the graph(user, system, nice etc?)
There should be shown all the values separately. There shouldn't be only one aggregated value for CPU utilization.
I see that CPU utilization is divided into user and system memory. Are these the only CPU metrics that user needs to know related to Gluster? Is for gluster use-case unnecessary to show nice, idle and IO-wait values?
I also see that the reported metrics are not shown cumulatively (for example https://grafana.com/dashboards/203). In my chart are shown values user=72% system=27. From chart it is not clear that almost all CPU resources are consumed. The threshold line is also not breached this way.
Created attachment 1351625 [details] CPU Utilization panel after modification
@fbalak we are only showing 2 metrics(user and system) as they are the actual ones that matter to end user or admins. >> Why are we not showing the cumulative? Earlier we were showing the cumulative/aggregated metric but we removed it because Cpu only has user, system, nice, idle and IO-wait value and if we show "total used" the end user can get confused upon what the total used it showing so we removed the total used from memory and CPU graph. Also, the threshold is still getting breached at the cumulative value of user and system but that cumulative value I have kept hidden. Also if you see the live example provided by grafana it doesn't have cumulative value. http://play.grafana.org/dashboard/db/prometheus-demo-dashboard?refresh=5m&orgId=1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3478
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days