1614486 – CPU utilization values provided in "CPU Utilization by Host" panel in Cluster dashboard are wrong

Bug 1614486 - CPU utilization values provided in "CPU Utilization by Host" panel in Cluster dashboard are wrong

Summary: CPU utilization values provided in "CPU Utilization by Host" panel in Cluster...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-monitoring-integration
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	gowtham
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-09 17:54 UTC by Martin Bukatovic
Modified:	2019-05-08 18:09 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-08 16:05:48 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
screenshot 1: terminal with running stress, top and sar on the affected machine next to the CPU Utilization by Host panel from the dashboard (60.49 KB, image/png) 2018-08-09 18:06 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1508520	0	unspecified	CLOSED	CPU Utilization values are wrong	2023-09-14 04:11:07 UTC

Internal Links: 1508520

Description Martin Bukatovic 2018-08-09 17:54:17 UTC

Description of problem
======================

Panel "CPU Utilization by Host" on Cluster dashboard (which contains table with single value for CPU utilization for each host) reports the CPU utilization in
wrong/misleading way.

The value reported breaks common expectations and Red Hat recommendations (see
https://access.redhat.com/solutions/2908361).

Version-Release number of selected component
============================================

tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch

Steps to Reproduce
==================

1. Instal RHGS WA using tendrl-ansible
2. Import Trusted storage pool
3. Select one storage machine from the pool and install stress tool there
4. Run the following command on the selected storage machine:

   stress --cpu ${N} --vm 1 --vm-bytes ${M}G

   where N is the number of cpu of particular machine,
   and M is roughly half of the total memory of the machine (in GB).

   The intended purpose of this command is to:

   * utilize all cpu on each machine to 100%
   * while having significant percentage of system cpu cycles

   Note that for this trick to work, you need a swap space on the machine.

   Let it running while you go on.

5. On the same storage machine, run also the following tools:

   * top
   * sar | tail -n2 | awk '/all/&&!/Average/ { print $1,$2, 100 - $NF}'

6. Go to Cluster dashboard and in Top Consumers section, check
   "CPU Utilization by Host" panel.

Actual results
==============

For the machine where the stress tool is running, the CPU utilization value
from "CPU Utilization by Host" panel is significantly less than 100 %, eg.
about 70 % in my case.

This is because only "user" part of CPU utilization is reported, ignoring the
rest (especially "system" part). As can be checked by commands from step #5:

 * top reports: about 70 % us, about 30 % sy, and zeroes for the others
 * sar command reports 100 %

Expected results
================

The expected CPU utilization value provided by the panel is about 100 % for
the affected storage machine.

Additional info
===============

See Red Hat KB article on the subject: "How to measure cpu usage with a single
value or metric to be used with monitoring and alerting tools?" Available on:

https://access.redhat.com/solutions/2908361

Related issue was found during previous testing phase, but it was resolved in
a different way, as there was no need to limit the chart to a single value:

https://bugzilla.redhat.com/show_bug.cgi?id=1508520

Comment 1 Martin Bukatovic 2018-08-09 18:06:37 UTC

Created attachment 1474788 [details]
screenshot 1: terminal with running stress, top and sar on the affected machine next to the CPU Utilization by Host panel from the dashboard

Comment 2 Martin Bukatovic 2018-08-09 18:17:53 UTC

Note that any cpu cycles spend by filesystem kernel threads are reported as
system cpu utilization and for this reason, choice to ignore this part of cpu
utilization completely when monitoring storage machines looks like a bad
decision even without taking the common expectations into account.

Comment 3 Martin Bukatovic 2018-08-09 18:27:11 UTC

Full list of WA packages on storage machine for reference:

tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-9.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-7.el7rhgs.noarch
tendrl-node-agent-1.6.3-9.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
collectd-5.7.2-3.1.el7rhgs.x86_64
collectd-ping-5.7.2-3.1.el7rhgs.x86_64
libcollectdclient-5.7.2-3.1.el7rhgs.x86_64

Note You need to log in before you can comment on or make changes to this bug.