1508520 – CPU Utilization values are wrong

Bug 1508520 - CPU Utilization values are wrong

Summary: CPU Utilization values are wrong

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-monitoring-integration
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Ankush Behl
QA Contact:	Martin Kudlej
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-01 15:33 UTC by Filip Balák
Modified:	2023-09-14 04:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:	tendrl-monitoring-integration-1.5.4-3.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-18 04:39:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
CPU Utilization panel (48.16 KB, image/png) 2017-11-01 15:33 UTC, Filip Balák	no flags	Details
CPU Utilization panel after modification (55.85 KB, image/png) 2017-11-13 15:57 UTC, Filip Balák	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	Tendrl monitoring-integration issues 27	None	closed	Status for 1st milestone	2020-01-31 14:11:54 UTC
Github	https://github.com/Tendrl monitoring-integration issues 229	None	None	None	2020-01-31 14:11:54 UTC
Red Hat Bugzilla	1358461	unspecified	CLOSED	cpu utilization values reported by RHSC 2.0 are wrong	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1614486	unspecified	CLOSED	CPU utilization values provided in "CPU Utilization by Host" panel in Cluster dashboard are wrong	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2017:3478	normal	SHIPPED_LIVE	RHGS Web Administration packages	2017-12-18 09:34:49 UTC

Internal Links: 1358461 1614486

Description Filip Balák 2017-11-01 15:33:23 UTC

Created attachment 1346601 [details]
CPU Utilization panel

Description of problem:
CPU utilization values are reported in wrong/misleading way. In CPU Utilization panel in Hosts dashboard is chart with only one measured value labeled by the host it measures. It seems to aggregate all memory types but the value in this chart isn't equal with any value reported by `top` command. All memory types should have reported their own value in chart and these values should be correctly labeled.

Version-Release number of selected component (if applicable):
Version-Release number of selected component (if applicable):
tendrl-api-httpd-1.5.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-node-agent-1.5.3-3.el7rhgs.noarch
tendrl-ui-1.5.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.5.3-2.el7rhgs.noarch
tendrl-notifier-1.5.3-1.el7rhgs.noarch
tendrl-ansible-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.3-1.el7rhgs.noarch
tendrl-api-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.3-2.el7rhgs.noarch
glusterfs-3.8.4-50.el7rhgs.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Import cluster with volume.
2. On one of the nodes install `stress` tool.
3. Run the following command on monitored machine:

   stress --cpu ${N} --vm 1 --vm-bytes ${M}G

   where N is the number of cpu of particular machine,
   and M is roughly half of the total memory of the machine (in GB).

   The intended purpose of this command is to:

   * utilize all cpu on each machine to 100%
   * while having significant percentage of system cpu cycles

4. Wait for new data to be collected and visualized in Grafana.
5. Check CPU Utilization panel in Hosts dashboard.

Actual results:
There is only one reported value in chart but it does not reflect CPU Utilization correctly.

My CPU utilization reported by `top`:
%Cpu(s): 66.8 us, 33.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st

CPU reported by Grafana: 50%


Expected results:
There should be correctly reported CPU utilization for:

user    : time running un-niced user processes
system  : time running kernel processes
nice    : time running niced user processes
idle    : time spent in the kernel idle handler
IO-wait : time waiting for I/O completion

Additional info:

Comment 1 Nishanth Thomas 2017-11-06 17:47:50 UTC

What is incorrect here?
You mean the current calculation is wrong or its not showing all the values separately on the graph(user, system, nice etc?)

Comment 2 Filip Balák 2017-11-07 12:07:25 UTC

There should be shown all the values separately. There shouldn't be only one aggregated value for CPU utilization.

Comment 3 Filip Balák 2017-11-13 15:48:10 UTC

I see that CPU utilization is divided into user and system memory. Are these the only CPU metrics that user needs to know related to Gluster? Is for gluster use-case unnecessary to show nice, idle and IO-wait values?

Comment 4 Filip Balák 2017-11-13 15:56:39 UTC

I also see that the reported metrics are not shown cumulatively (for example https://grafana.com/dashboards/203). In my chart are shown values user=72% system=27. From chart it is not clear that almost all CPU resources are consumed. The threshold line is also not breached this way.

Comment 5 Filip Balák 2017-11-13 15:57:40 UTC

Created attachment 1351625 [details]
CPU Utilization panel after modification

Comment 7 Ankush Behl 2017-11-18 06:18:15 UTC

@fbalak we are only showing 2 metrics(user and system) as they are the actual ones that matter to end user or admins.

>> Why are we not showing the cumulative?

Earlier we were showing the cumulative/aggregated metric but we removed it because Cpu only has user, system, nice, idle and IO-wait value and if we show "total used" the end user can get confused upon what the total used it showing so we removed the total used from memory and CPU graph. Also, the threshold is still getting breached at the cumulative value of user and system but that cumulative value I have kept hidden.

Also if you see the live example provided by grafana it doesn't have cumulative value. http://play.grafana.org/dashboard/db/prometheus-demo-dashboard?refresh=5m&orgId=1

Comment 11 errata-xmlrpc 2017-12-18 04:39:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478

Comment 12 Red Hat Bugzilla 2023-09-14 04:11:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.