1581736 – IOPS metric is not intuitive enough

Bug 1581736 - IOPS metric is not intuitive enough

Summary: IOPS metric is not intuitive enough

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-monitoring-integration
Sub Component:
Version:	rhgs-3.4
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Shubhendu Tripathi
QA Contact:	Martin Bukatovic
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1588699 (view as bug list)
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2018-05-23 14:10 UTC by Anand Paladugu
Modified:	2019-05-01 12:35 UTC (History)
CC List:	10 users (show)
Fixed In Version:	tendrl-monitoring-integration-1.6.3-5.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 07:07:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screenshot of IOPS dashboard on cluster dashboard (as shown when the BZ was created) (8.60 KB, image/png) 2018-06-19 16:26 UTC, Martin Bukatovic	no flags	Details
IOPS dashboard (23.65 KB, image/png) 2018-06-25 17:26 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1587804	1	None	None	None	2024-09-18 00:48:01 UTC
Red Hat Bugzilla	1593852	0	unspecified	CLOSED	IOPS chart on Disk Load of Brick Dashboard shows no data during brick read/write operation	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1593912	0	unspecified	CLOSED	IOPS chart from At Glance section of Host Dashboard reports different values compared to all other IOPS charts	2023-09-14 04:30:11 UTC
Red Hat Bugzilla	1594383	0	unspecified	CLOSED	IOPS chart shows no data points after writing of a file fails on unavailable free space on a brick	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1594899	0	unspecified	CLOSED	Most IOPS charts in At a Glance section of Brick Dashboards shows no data for short or light workloads	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:2616	0	None	None	None	2018-09-04 07:08:05 UTC

Internal Links: 1587804 1593852 1593912 1594383 1594899

Description Anand Paladugu 2018-05-23 14:10:53 UTC

Description of problem: 

IOPS panel is not intuitive

1. IOPS metric panel does not have any scale but has a graph.  
2. Performance engineering indicated that we may be able to reach 40K IOPS per their testing, however in my testing with one client driving IO,  This panel shows 2 Billion IOPS during the test which seems to be wrong
3. The intent of the IOPS count in the middle of the panel is like an IOPS meter, but since we can only poll at 5 sec internal at a minimum the smooth meter effect is missing


Version-Release number of selected component (if applicable): 3.4


How reproducible:  Generate traffic in an RHGS cluster and view the Cluster dashboard in WA


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Shubhendu Tripathi 2018-05-24 09:08:52 UTC

Anand, RHGS-WA uses the command `gluster volume profile {volname} info --xml` to get the iops values. For each brick fields `intervalStats/totalRead` and `intervalStats/totalWrite` decide the values which would be populated under `clusters.{int-id}.volumes.{volid}.nodes.{fqdn}.bricks.{brickname}.iops.gauge-read` and `clusters.{int-id}.volumes.{volid}.nodes.{fqdn}.bricks.{brickname}.iops.gauge-write` respectively in graphite. 

These two values are aggregated and shown in grafana dashboard.

Wanted to know

- how these values shown in grafana are being compared?
- What values we check in underlying cluster and then compare the values?
- Are we running the command `gluster v profile info` and then comparing?

@Ankush, is this analysis correct about the data shown in grafana? Do you need anymore details to check this issue?

Comment 3 Shubhendu Tripathi 2018-05-24 10:01:19 UTC

Based on discussion with Manoj from perf eng team, looks like the logic to calculate the read iops and write iops in collectd plugins of RHGS-WA is wrong and need to be corrected.

As command `gluster v profile {volname} info` report values in `bytes`, the formula to be used for calculation of read iops and write iops should be as below

```
read-iops = summation_of(no. of reads of different sizes)/duration
write-iops = summation_of(no. of writes of different sizes)/duration
```

and then 

```
total iops = read-iops + write-iops
```

Manoj, kindly ack.

>> 3. The intent of the IOPS count in the middle of the panel is like an IOPS
>> meter, but since we can only poll at 5 sec internal at a minimum the smooth >> meter effect is missing

Ju, I feel your comments required on this.

Comment 4 Manoj Pillai 2018-05-24 15:17:07 UTC

Here's a sample gluster volume profile output interval:

<quote>
Interval 5 Stats:
   Block Size:               2048b+                8192b+
 No. of Reads:                    0                     0
No. of Writes:                   10               1048576
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.66      37.05 us      14.00 us    2268.00 us          21825    FINODELK
      0.94     125.47 us      36.00 us    5246.00 us           9097       FSYNC
      2.92     163.25 us      99.00 us    4638.00 us          21812    FXATTROP
     95.48     111.08 us      78.00 us    6597.00 us        1048586       WRITE

    Duration: 178 seconds
   Data Read: 0 bytes
Data Written: 8589965312 bytes
</quote>

What I gathered from my discussion from Shubhendu is that RHGS-WA is currently reporting the last two lines as IOPS. Which would explain why Anand saw numbers in the range of billions.

(Data_Read+Data_Written)/Duration would give you throughput (bytes/s)
For IOPS, the right calculation would be (total no. of reads and writes)/Duration. In this case that comes to about 6k, which is what the benchmark was reporting.

Comment 5 Shubhendu Tripathi 2018-05-25 04:31:54 UTC

Thanks Manoj for detailed sample. Can I request to add one --xml sample as well for our help?

Comment 10 Ju Lim 2018-05-29 18:05:49 UTC

@shubhendu

> (Data_Read+Data_Written)/Duration would give you throughput (bytes/s)
> For IOPS, the right calculation would be (total no. of reads and writes)/Duration.

Looks good to me.

Comment 12 Ju Lim 2018-06-06 12:46:34 UTC

For the IOPS dashboard panel, since we're using sparkline with no drilldown capabilities, we're not able to see the values with the crosshair.  My recommendation is that we switch this to a line chart like the Capacity Utilization Trend dashboard panel.

Comment 13 Ju Lim 2018-06-06 12:51:04 UTC

Affects cluster, volume, and brick dashboards.

Comment 15 Ankush Behl 2018-06-07 07:23:38 UTC

@shtripat if we are planning to put the trending charts then we can't show"120 IOPS" 120 will be a value in graph and iops will on Y-axis of the graph.

Comment 16 Martin Bukatovic 2018-06-13 09:01:39 UTC

It turned out that (see BZ 1587804):

> When one will enable/disable profiling over
> period of time on multiple volumes, as the IOPS number is not clearly defined
> or meaningful.
> For example we have 4 volumes, but only 2 have
> profiling enabled, then we enable profiling on 3rd volume as well, which will
> have an impact on aggregated IOPS numbers reported just by the sheer fact that
> the profiling is enabled (before that, numbers were not included because the
> profiling was disabled)>

Should we create a doc BZ for this? Should this be addressed as part of this BZ?

Comment 17 Nishanth Thomas 2018-06-14 10:10:58 UTC

*** Bug 1588699 has been marked as a duplicate of this bug. ***

Comment 19 Ankush Behl 2018-06-18 10:00:00 UTC

@mbukatov Dev team has decided to go with the line/graph charts instead of the current sparkling charts.

Following are the panels that will be changed:

Cluster dashboard: IOPS dashboard panel
Volume dashboard: Throughput dashboard panel
Brick dashboard IOPS dashboard panel

Name change: Change the name of "Throughput" panel to the "IOPS" panel in volume dashboard as the panel is showing the (read+write) which is IOPS.

Comment 20 Martin Bukatovic 2018-06-19 16:26:23 UTC

Created attachment 1452990 [details]
Screenshot of IOPS dashboard on cluster dashboard (as shown when the BZ was created)

Adding this for future reference, because this BZ is missing both screenshot of
IOPS chart and version of all tendrl rpm packages (or just tendrl monitoring
integration, which is the component which implements the dashboards).

Screenshot shows IOPS dashboard on cluster dashboard.

As you can see, it's not possible to see scale, units of measure nor check
particular datapoint to get more information about value at that time.

Version of the package:

tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch

Comment 21 Martin Bukatovic 2018-06-19 17:31:16 UTC

Ok, so I think that the work to get this BZ fixed includes:

* Change the way WA calculates IOPS from output of `gluster v profile {volname}
  info` (explained in comment 3 and comment 4):

  ```
  read-iops = summation_of(no. of reads of different sizes)/duration
  write-iops = summation_of(no. of writes of different sizes)/duration
  ```

* Update IOPS chart from sparkline to line chart, so that it's possible to
  use crosshair to inspect particular datapoint

* Update IOPS chart to include scale with units of measure

* Remove current IOPS value from the middle of IOPS chart

* These changes of IOPS chart should be done for every IOPS char we have:

  * Cluster dashboard: IOPS dashboard panel
  * Volume dashboard: Throughput dashboard panel
  * Brick dashboard IOPS dashboard panel
  * Host dashboard: IOPS

* Change the name of "Throughput" panel to the "IOPS" panel in volume dashboard
  as the panel is showing the (read+write) which is IOPS.

Problem described in comment 16 is not part of this BZ, and has a low priority,
as nobody was interested to comment on it.

Comment 22 Martin Bukatovic 2018-06-21 17:15:07 UTC

Updated list of IOPS charts:

 * Cluster Dashboard: At Glance: IOPS
 * Volume Dashboard: Performance: IOPS
 * Host Dashboard: At Glance: IOPS
 * Brick Dashboard: At Glance: IOPS

Skipped for the purpose of this BZ:

 * Brick Dashboard: Disk Load: IOPS (see BZ 1593852 for details)

Comment 24 Martin Bukatovic 2018-06-25 17:26:05 UTC

Created attachment 1454428 [details]
IOPS dashboard

Testing in "RFE mode", as noted in comment 23, verifying items as described
in comment 21 and clarified on comment 22:.

This means I have checking that changes as described above were made, but
not claiming that there are no bugs in the IOPS charts feature (both related
and unrelated to this change), see list of BZs below.

version
-------

tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

calculating values
------------------

The IOPS values are really calculated as described in comment 21 (rechecked
manually on a single value only, but given the order of magnitude difference,
this seems enough).

Checking brick beta_arbiter_3 on machine gl1. The output of "gluster v profile
VOLUME info" command contains following entries for this brick:


```
      0.96     162.83 us      84.00 us    2809.00 us           2232       WRITE 
    Duration: 3102 seconds                                                      
      0.77     151.55 us      97.00 us     283.00 us             53       WRITE                                                                              
    Duration: 16 seconds
```

Which gives us (2232/3102 + 53/16) IOPS =~ 4.03 IOPS, which matches what is
reported on IOPS for this brick at the time when this command was issued.

design of iops charts
---------------------

has been updated as described in comment 21

known problems
--------------

During testing of this BZ, the following bugs were found and reported:

 * BZ 1594899
 * BZ 1593852
 * BZ 1593912
 * BZ 1594383
 * BZ 1594342

Comment 25 Martin Bukatovic 2018-08-29 17:01:11 UTC

fixing asignee (changed by mistake when the BZ was verified)

Comment 27 errata-xmlrpc 2018-09-04 07:07:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Comment 28 Anand Paladugu 2019-05-01 12:35:25 UTC

For some reason this is still in need info on my name.  Hoping that this comment will take it out of that state.

Note You need to log in before you can comment on or make changes to this bug.