Description of problem ====================== RHSC 2.0 reports cpu utilization values in a wrong/misleading way. CPU utilization value reported by console comes directly from collectd's cpu user utilization. Which means that system cpu utilization is completely omitted. Bear in mind that this for example means that any filesystem related cpu utilization would not be reported (because filesytem code runs in kernel space). Version-Release =============== On RHSC 2.0 server machine: rhscon-ui-0.0.48-1.el7scon.noarch rhscon-core-selinux-0.0.34-1.el7scon.noarch rhscon-ceph-0.0.33-1.el7scon.x86_64 rhscon-core-0.0.34-1.el7scon.x86_64 ceph-installer-1.0.14-1.el7scon.noarch ceph-ansible-1.0.5-28.el7scon.noarch On Ceph machines: rhscon-core-selinux-0.0.34-1.el7scon.noarch rhscon-agent-0.0.15-1.el7scon.noarch ceph-selinux-10.2.2-22.el7cp.x86_64 ceph-common-10.2.2-22.el7cp.x86_64 calamari-server-1.4.6-1.el7cp.x86_64 How reproducible ================ 100 % Steps to Reproduce ================== 1. Install RHSC 2.0 following the documentation. 2. Accept few nodes for the ceph cluster. 3. Create new ceph cluster named 'alpha'. 4. On all ceph machines (of either OSD or MON roles), install stress tool. 5. Run the following command on all ceph machines: stress --cpu ${N} --vm 1 --vm-bytes ${M}G where N is the number of cpu of particular machine, and M is roughly half of the total memory of the machine (in GB). The intended purpose of this command is to: * utilize all cpu on each machine to 100% * while having significant percentage of system cpu cycles 6. Wait about 15 minutes for collecd and RHSC 2.0 to process and report new utilization state. 7. Check cpu utilization reported by RHSC 2.0 (there are multiple places with this information). Actual results ============== We expect that cpu utilization is 100% on all machines of the cluster, which can be checked by running `top` tool there: ~~~ %Cpu(s): 71.7 us, 28.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st ~~~ In this example, cpu is 100% utilized (71.7 by user space, and 28.3 by system). So the assumption is correct. But when we look at Hosts list and check cpu utilization there, we see different values (see screenshot #1), such as: * 68.2 % * 69.0 % * 67.3 % What is going on here? It seems that instead of total cpu utilization, user space only utilization is reported. The same issue applies everywhere cpu utilization is reported, eg. on main dashboard, in system performance widget, I see total cpu utilization to be 68.7 % (see screenshot #2). Expected results ================ CPU utilization should be reported as 100% everywhere.
Created attachment 1182211 [details] screenshot 1: host list page (see cpu utilization charts there)
Created attachment 1182212 [details] screenshot 2: cpu utilization as reported on main dashboard
If you stress the machine 100%, other processes might not be getting cpu time that might be reason why pushing the stats to collectd is failing. One option to verify this this to not raise the utilization to 100% keep it less than 100% and try. Also note that the plugin which collects and reports CPU utilization is a collecd plugin and not something written by us. Could please try the above and let us know
(In reply to Nishanth Thomas from comment #4) > If you stress the machine 100%, other processes might not be getting cpu > time that might be reason why pushing the stats to collectd is failing. One > option to verify this this to not raise the utilization to 100% keep it less > than 100% and try. This is not the case, all data are properly logged and aggregated by collectd, as can be seen on screenshot 3 (graphite generated chart for user and system cpu utilization on one host during my stress testing). I would suggest to reread whole description of this BZ again, I tried to be very clear pointing out where the issue is. > Also note that the plugin which collects and reports CPU > utilization is a collecd plugin and not something written by us. I agree cpu utilization monitoring and collecting component was not written by you - you are using collectd for this. Also I know that collectd logs multiple cpu utilization values (percent-idle, percent-interrupt, percent-nice, percent-softirq, percent-steal, percent-system, percent-user, percent-wait). The problem I see is different: RHSC 2.0 use just `percent-user` values ignoring `percent-system`, and I'm afraid that this could be misleading.
Created attachment 1182548 [details] screenshot 3: graphite generated chart of user and system cpu utilization during stress testing Attaching screenshot 3 referenced in a previous comment.
As per program meeting, decided to move to 3.0
Tested with Server: ceph-ansible-1.0.5-32.el7scon.noarch ceph-installer-1.0.15-2.el7scon.noarch graphite-web-0.9.12-8.1.el7.noarch rhscon-ceph-0.0.41-1.el7scon.x86_64 rhscon-core-selinux-0.0.42-1.el7scon.noarch rhscon-core-0.0.42-1.el7scon.x86_64 rhscon-ui-0.0.55-1.el7scon.noarch Node: calamari-server-1.4.8-1.el7cp.x86_64 ceph-base-10.2.2-41.el7cp.x86_64 ceph-common-10.2.2-41.el7cp.x86_64 ceph-mon-10.2.2-41.el7cp.x86_64 ceph-osd-10.2.2-41.el7cp.x86_64 ceph-selinux-10.2.2-41.el7cp.x86_64 libcephfs1-10.2.2-41.el7cp.x86_64 python-cephfs-10.2.2-41.el7cp.x86_64 rhscon-agent-0.0.19-1.el7scon.noarch rhscon-core-selinux-0.0.42-1.el7scon.noarch and it works as it is expected. --> Verified
Looks good to me
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:2082