1358461 – cpu utilization values reported by RHSC 2.0 are wrong

Bug 1358461 - cpu utilization values reported by RHSC 2.0 are wrong

Summary: cpu utilization values reported by RHSC 2.0 are wrong

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	2
Assignee:	Nishanth Thomas
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Console-2-Async
TreeView+	depends on / blocked

Reported:	2016-07-20 18:21 UTC by Martin Bukatovic
Modified:	2017-11-01 15:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:	rhscon-core-0.0.42-1.el7scon.x86_64, rhscon-ui-0.0.55-1.el7scon.noarch
Doc Type:	Bug Fix
Doc Text:	Previously, the CPU utilization chart displayed only the user processes CPU utilization and omitted system CPU utilization. With this update, the CPU utilization chart displays the combined user and system CPU utilization percentage.
Clone Of:
Environment:
Last Closed:	2016-10-19 15:20:40 UTC
Embargoed:

Attachments	(Terms of Use)
screenshot 1: host list page (see cpu utilization charts there) (98.46 KB, image/png) 2016-07-20 18:23 UTC, Martin Bukatovic	no flags	Details
screenshot 2: cpu utilization as reported on main dashboard (21.26 KB, image/png) 2016-07-20 18:24 UTC, Martin Bukatovic	no flags	Details
screenshot 3: graphite generated chart of user and system cpu utilization during stress testing (95.28 KB, image/png) 2016-07-21 14:47 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Gerrithub.io	288860	None	None	None	2016-09-08 07:36:02 UTC
Red Hat Bugzilla	1508520	unspecified	CLOSED	CPU Utilization values are wrong	2023-09-14 04:11:07 UTC
Red Hat Product Errata	RHSA-2016:2082	normal	SHIPPED_LIVE	Moderate: Red Hat Storage Console 2 security and bug fix update	2017-04-18 19:29:02 UTC

Internal Links: 1508520

Description Martin Bukatovic 2016-07-20 18:21:49 UTC

Description of problem
======================

RHSC 2.0 reports cpu utilization values in a wrong/misleading way.

CPU utilization value reported by console comes directly from collectd's
cpu user utilization. Which means that system cpu utilization is completely
omitted.

Bear in mind that this for example means that any filesystem related cpu
utilization would not be reported (because filesytem code runs in kernel
space).

Version-Release
===============

On RHSC 2.0 server machine:

rhscon-ui-0.0.48-1.el7scon.noarch
rhscon-core-selinux-0.0.34-1.el7scon.noarch
rhscon-ceph-0.0.33-1.el7scon.x86_64
rhscon-core-0.0.34-1.el7scon.x86_64
ceph-installer-1.0.14-1.el7scon.noarch
ceph-ansible-1.0.5-28.el7scon.noarch

On Ceph machines:

rhscon-core-selinux-0.0.34-1.el7scon.noarch
rhscon-agent-0.0.15-1.el7scon.noarch
ceph-selinux-10.2.2-22.el7cp.x86_64
ceph-common-10.2.2-22.el7cp.x86_64
calamari-server-1.4.6-1.el7cp.x86_64

How reproducible
================

100 %

Steps to Reproduce
==================

1. Install RHSC 2.0 following the documentation.
2. Accept few nodes for the ceph cluster.
3. Create new ceph cluster named 'alpha'.
4. On all ceph machines (of either OSD or MON roles), install stress tool.
5. Run the following command on all ceph machines:

   stress --cpu ${N} --vm 1 --vm-bytes ${M}G

   where N is the number of cpu of particular machine,
   and M is roughly half of the total memory of the machine (in GB).

   The intended purpose of this command is to:

   * utilize all cpu on each machine to 100%
   * while having significant percentage of system cpu cycles

6. Wait about 15 minutes for collecd and RHSC 2.0 to process and report new
   utilization state.
7. Check cpu utilization reported by RHSC 2.0 (there are multiple places
   with this information).

Actual results
==============

We expect that cpu utilization is 100% on all machines of the cluster, which
can be checked by running `top` tool there:

~~~
%Cpu(s): 71.7 us, 28.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
~~~

In this example, cpu is 100% utilized (71.7 by user space, and 28.3 by system).

So the assumption is correct.

But when we look at Hosts list and check cpu utilization there, we see
different values (see screenshot #1), such as:

 * 68.2 %
 * 69.0 %
 * 67.3 %

What is going on here? It seems that instead of total cpu utilization, user
space only utilization is reported.

The same issue applies everywhere cpu utilization is reported, eg. on main
dashboard, in system performance widget, I see total cpu utilization to be 
68.7 % (see screenshot #2).

Expected results
================

CPU utilization should be reported as 100% everywhere.

Comment 1 Martin Bukatovic 2016-07-20 18:23:52 UTC

Created attachment 1182211 [details]
screenshot 1: host list page (see cpu utilization charts there)

Comment 2 Martin Bukatovic 2016-07-20 18:24:17 UTC

Created attachment 1182212 [details]
screenshot 2: cpu utilization as reported on main dashboard

Comment 4 Nishanth Thomas 2016-07-21 11:58:50 UTC

If you stress the machine 100%, other processes might not be getting cpu time that might be reason why pushing the stats to collectd is failing. One option to verify this this to not raise the utilization to 100% keep it less than 100% and try. Also note that the plugin which collects and reports CPU utilization is a collecd plugin and not something written by us. Could please try the above and let us know

Comment 5 Martin Bukatovic 2016-07-21 14:46:11 UTC

(In reply to Nishanth Thomas from comment #4)
> If you stress the machine 100%, other processes might not be getting cpu
> time that might be reason why pushing the stats to collectd is failing. One
> option to verify this this to not raise the utilization to 100% keep it less
> than 100% and try.

This is not the case, all data are properly logged and aggregated by collectd,
as can be seen on screenshot 3 (graphite generated chart for user and system
cpu utilization on one host during my stress testing).

I would suggest to reread whole description of this BZ again, I tried
to be very clear pointing out where the issue is.

> Also note that the plugin which collects and reports CPU
> utilization is a collecd plugin and not something written by us.

I agree cpu utilization monitoring and collecting component was not written
by you - you are using collectd for this. Also I know that collectd
logs multiple cpu utilization values (percent-idle, percent-interrupt,
percent-nice, percent-softirq, percent-steal, percent-system, percent-user,
percent-wait).

The problem I see is different: RHSC 2.0 use just `percent-user` values ignoring
`percent-system`, and I'm afraid that this could be misleading.

Comment 6 Martin Bukatovic 2016-07-21 14:47:04 UTC

Created attachment 1182548 [details]
screenshot 3: graphite generated chart of user and system cpu utilization during stress testing

Attaching screenshot 3 referenced in a previous comment.

Comment 7 Nishanth Thomas 2016-07-22 12:09:02 UTC

As per program meeting, decided to move to 3.0

Comment 10 Filip Balák 2016-09-21 15:22:24 UTC

Tested with
Server:
ceph-ansible-1.0.5-32.el7scon.noarch
ceph-installer-1.0.15-2.el7scon.noarch
graphite-web-0.9.12-8.1.el7.noarch
rhscon-ceph-0.0.41-1.el7scon.x86_64
rhscon-core-selinux-0.0.42-1.el7scon.noarch
rhscon-core-0.0.42-1.el7scon.x86_64
rhscon-ui-0.0.55-1.el7scon.noarch

Node:
calamari-server-1.4.8-1.el7cp.x86_64
ceph-base-10.2.2-41.el7cp.x86_64
ceph-common-10.2.2-41.el7cp.x86_64
ceph-mon-10.2.2-41.el7cp.x86_64
ceph-osd-10.2.2-41.el7cp.x86_64
ceph-selinux-10.2.2-41.el7cp.x86_64
libcephfs1-10.2.2-41.el7cp.x86_64
python-cephfs-10.2.2-41.el7cp.x86_64
rhscon-agent-0.0.19-1.el7scon.noarch
rhscon-core-selinux-0.0.42-1.el7scon.noarch

and it works as it is expected. --> Verified

Comment 12 anmol babu 2016-10-17 10:48:48 UTC

Looks good to me

Comment 13 errata-xmlrpc 2016-10-19 15:20:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:2082

Note You need to log in before you can comment on or make changes to this bug.