Bug 1358461

Summary: cpu utilization values reported by RHSC 2.0 are wrong
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Martin Bukatovic <mbukatov>
Component: coreAssignee: Nishanth Thomas <nthomas>
core sub component: monitoring QA Contact: sds-qe-bugs
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: anbabu, fbalak, kchidamb, mbukatov, mkudlej, rghatvis, vsarmila
Version: 2   
Target Milestone: ---   
Target Release: 2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhscon-core-0.0.42-1.el7scon.x86_64, rhscon-ui-0.0.55-1.el7scon.noarch Doc Type: Bug Fix
Doc Text:
Previously, the CPU utilization chart displayed only the user processes CPU utilization and omitted system CPU utilization. With this update, the CPU utilization chart displays the combined user and system CPU utilization percentage.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-19 15:20:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1357777    
Attachments:
Description Flags
screenshot 1: host list page (see cpu utilization charts there)
none
screenshot 2: cpu utilization as reported on main dashboard
none
screenshot 3: graphite generated chart of user and system cpu utilization during stress testing none

Description Martin Bukatovic 2016-07-20 18:21:49 UTC
Description of problem
======================

RHSC 2.0 reports cpu utilization values in a wrong/misleading way.

CPU utilization value reported by console comes directly from collectd's
cpu user utilization. Which means that system cpu utilization is completely
omitted.

Bear in mind that this for example means that any filesystem related cpu
utilization would not be reported (because filesytem code runs in kernel
space).

Version-Release
===============

On RHSC 2.0 server machine:

rhscon-ui-0.0.48-1.el7scon.noarch
rhscon-core-selinux-0.0.34-1.el7scon.noarch
rhscon-ceph-0.0.33-1.el7scon.x86_64
rhscon-core-0.0.34-1.el7scon.x86_64
ceph-installer-1.0.14-1.el7scon.noarch
ceph-ansible-1.0.5-28.el7scon.noarch

On Ceph machines:

rhscon-core-selinux-0.0.34-1.el7scon.noarch
rhscon-agent-0.0.15-1.el7scon.noarch
ceph-selinux-10.2.2-22.el7cp.x86_64
ceph-common-10.2.2-22.el7cp.x86_64
calamari-server-1.4.6-1.el7cp.x86_64

How reproducible
================

100 %

Steps to Reproduce
==================

1. Install RHSC 2.0 following the documentation.
2. Accept few nodes for the ceph cluster.
3. Create new ceph cluster named 'alpha'.
4. On all ceph machines (of either OSD or MON roles), install stress tool.
5. Run the following command on all ceph machines:

   stress --cpu ${N} --vm 1 --vm-bytes ${M}G

   where N is the number of cpu of particular machine,
   and M is roughly half of the total memory of the machine (in GB).

   The intended purpose of this command is to:

   * utilize all cpu on each machine to 100%
   * while having significant percentage of system cpu cycles

6. Wait about 15 minutes for collecd and RHSC 2.0 to process and report new
   utilization state.
7. Check cpu utilization reported by RHSC 2.0 (there are multiple places
   with this information).

Actual results
==============

We expect that cpu utilization is 100% on all machines of the cluster, which
can be checked by running `top` tool there:

~~~
%Cpu(s): 71.7 us, 28.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
~~~

In this example, cpu is 100% utilized (71.7 by user space, and 28.3 by system).

So the assumption is correct.

But when we look at Hosts list and check cpu utilization there, we see
different values (see screenshot #1), such as:

 * 68.2 %
 * 69.0 %
 * 67.3 %

What is going on here? It seems that instead of total cpu utilization, user
space only utilization is reported.

The same issue applies everywhere cpu utilization is reported, eg. on main
dashboard, in system performance widget, I see total cpu utilization to be 
68.7 % (see screenshot #2).

Expected results
================

CPU utilization should be reported as 100% everywhere.

Comment 1 Martin Bukatovic 2016-07-20 18:23:52 UTC
Created attachment 1182211 [details]
screenshot 1: host list page (see cpu utilization charts there)

Comment 2 Martin Bukatovic 2016-07-20 18:24:17 UTC
Created attachment 1182212 [details]
screenshot 2: cpu utilization as reported on main dashboard

Comment 4 Nishanth Thomas 2016-07-21 11:58:50 UTC
If you stress the machine 100%, other processes might not be getting cpu time that might be reason why pushing the stats to collectd is failing. One option to verify this this to not raise the utilization to 100% keep it less than 100% and try. Also note that the plugin which collects and reports CPU utilization is a collecd plugin and not something written by us. Could please try the above and let us know

Comment 5 Martin Bukatovic 2016-07-21 14:46:11 UTC
(In reply to Nishanth Thomas from comment #4)
> If you stress the machine 100%, other processes might not be getting cpu
> time that might be reason why pushing the stats to collectd is failing. One
> option to verify this this to not raise the utilization to 100% keep it less
> than 100% and try.

This is not the case, all data are properly logged and aggregated by collectd,
as can be seen on screenshot 3 (graphite generated chart for user and system
cpu utilization on one host during my stress testing).

I would suggest to reread whole description of this BZ again, I tried
to be very clear pointing out where the issue is.

> Also note that the plugin which collects and reports CPU
> utilization is a collecd plugin and not something written by us.

I agree cpu utilization monitoring and collecting component was not written
by you - you are using collectd for this. Also I know that collectd
logs multiple cpu utilization values (percent-idle, percent-interrupt,
percent-nice, percent-softirq, percent-steal, percent-system, percent-user,
percent-wait).

The problem I see is different: RHSC 2.0 use just `percent-user` values ignoring
`percent-system`, and I'm afraid that this could be misleading.

Comment 6 Martin Bukatovic 2016-07-21 14:47:04 UTC
Created attachment 1182548 [details]
screenshot 3: graphite generated chart of user and system cpu utilization during stress testing

Attaching screenshot 3 referenced in a previous comment.

Comment 7 Nishanth Thomas 2016-07-22 12:09:02 UTC
As per program meeting, decided to move to 3.0

Comment 10 Filip Balák 2016-09-21 15:22:24 UTC
Tested with
Server:
ceph-ansible-1.0.5-32.el7scon.noarch
ceph-installer-1.0.15-2.el7scon.noarch
graphite-web-0.9.12-8.1.el7.noarch
rhscon-ceph-0.0.41-1.el7scon.x86_64
rhscon-core-selinux-0.0.42-1.el7scon.noarch
rhscon-core-0.0.42-1.el7scon.x86_64
rhscon-ui-0.0.55-1.el7scon.noarch

Node:
calamari-server-1.4.8-1.el7cp.x86_64
ceph-base-10.2.2-41.el7cp.x86_64
ceph-common-10.2.2-41.el7cp.x86_64
ceph-mon-10.2.2-41.el7cp.x86_64
ceph-osd-10.2.2-41.el7cp.x86_64
ceph-selinux-10.2.2-41.el7cp.x86_64
libcephfs1-10.2.2-41.el7cp.x86_64
python-cephfs-10.2.2-41.el7cp.x86_64
rhscon-agent-0.0.19-1.el7scon.noarch
rhscon-core-selinux-0.0.42-1.el7scon.noarch

and it works as it is expected. --> Verified

Comment 12 anmol babu 2016-10-17 10:48:48 UTC
Looks good to me

Comment 13 errata-xmlrpc 2016-10-19 15:20:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:2082