Description of problem ====================== Panel "CPU Utilization by Host" on Cluster dashboard (which contains table with single value for CPU utilization for each host) reports the CPU utilization in wrong/misleading way. The value reported breaks common expectations and Red Hat recommendations (see https://access.redhat.com/solutions/2908361). Version-Release number of selected component ============================================ tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch Steps to Reproduce ================== 1. Instal RHGS WA using tendrl-ansible 2. Import Trusted storage pool 3. Select one storage machine from the pool and install stress tool there 4. Run the following command on the selected storage machine: stress --cpu ${N} --vm 1 --vm-bytes ${M}G where N is the number of cpu of particular machine, and M is roughly half of the total memory of the machine (in GB). The intended purpose of this command is to: * utilize all cpu on each machine to 100% * while having significant percentage of system cpu cycles Note that for this trick to work, you need a swap space on the machine. Let it running while you go on. 5. On the same storage machine, run also the following tools: * top * sar | tail -n2 | awk '/all/&&!/Average/ { print $1,$2, 100 - $NF}' 6. Go to Cluster dashboard and in Top Consumers section, check "CPU Utilization by Host" panel. Actual results ============== For the machine where the stress tool is running, the CPU utilization value from "CPU Utilization by Host" panel is significantly less than 100 %, eg. about 70 % in my case. This is because only "user" part of CPU utilization is reported, ignoring the rest (especially "system" part). As can be checked by commands from step #5: * top reports: about 70 % us, about 30 % sy, and zeroes for the others * sar command reports 100 % Expected results ================ The expected CPU utilization value provided by the panel is about 100 % for the affected storage machine. Additional info =============== See Red Hat KB article on the subject: "How to measure cpu usage with a single value or metric to be used with monitoring and alerting tools?" Available on: https://access.redhat.com/solutions/2908361 Related issue was found during previous testing phase, but it was resolved in a different way, as there was no need to limit the chart to a single value: https://bugzilla.redhat.com/show_bug.cgi?id=1508520
Created attachment 1474788 [details] screenshot 1: terminal with running stress, top and sar on the affected machine next to the CPU Utilization by Host panel from the dashboard
Note that any cpu cycles spend by filesystem kernel threads are reported as system cpu utilization and for this reason, choice to ignore this part of cpu utilization completely when monitoring storage machines looks like a bad decision even without taking the common expectations into account.
Full list of WA packages on storage machine for reference: tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-9.el7rhgs.noarch tendrl-gluster-integration-1.6.3-7.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64