Bug 1594899 - Most IOPS charts in At a Glance section of Brick Dashboards shows no data for short or light workloads
Summary: Most IOPS charts in At a Glance section of Brick Dashboards shows no data for...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-monitoring-integration
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: Shubhendu Tripathi
QA Contact: Martin Bukatovic
URL:
Whiteboard:
Depends On:
Blocks: 1503137
TreeView+ depends on / blocked
 
Reported: 2018-06-25 16:09 UTC by Martin Bukatovic
Modified: 2018-09-04 07:08 UTC (History)
4 users (show)

Fixed In Version: tendrl-monitoring-integration-1.6.3-6.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 07:07:57 UTC
Embargoed:


Attachments (Terms of Use)
wiki export split script (7.34 KB, text/x-python)
2018-06-25 16:12 UTC, Martin Bukatovic
no flags Details
screenshot 1 (this looks ok) (163.44 KB, image/png)
2018-06-25 16:22 UTC, Martin Bukatovic
no flags Details
screenshot 2: zeroes reported (162.08 KB, image/png)
2018-06-25 16:22 UTC, Martin Bukatovic
no flags Details
screenshot 3: no data reported at all (161.47 KB, image/png)
2018-06-25 16:23 UTC, Martin Bukatovic
no flags Details
screenshot 4: short term vs long term workload (213.76 KB, image/png)
2018-06-26 08:57 UTC, Martin Bukatovic
no flags Details
IOPS shooting up while writing small no of smaller files to the volume mount (19.75 KB, image/png)
2018-06-28 07:44 UTC, Shubhendu Tripathi
no flags Details
iops_while_no_of_small_file_being_written (90.79 KB, image/png)
2018-07-03 13:33 UTC, Shubhendu Tripathi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github Tendrl monitoring-integration issues 504 0 None None None 2018-07-03 13:51:26 UTC
Github Tendrl monitoring-integration issues 505 0 None None None 2018-07-03 13:52:11 UTC
Red Hat Bugzilla 1581736 0 unspecified CLOSED IOPS metric is not intuitive enough 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2018:2616 0 None None None 2018-09-04 07:08:46 UTC

Internal Links: 1581736

Description Martin Bukatovic 2018-06-25 16:09:13 UTC
Description of problem
======================

Most IOPS charts in At a Glance section of Brick Dashboards shows no data when
I extract about 10000 files from tarball on arbiter 2 plus 1x2 volume[1].

Version-Release number
======================

tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]#  rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-12.el7rhgs.x86_64
glusterfs-api-3.12.2-12.el7rhgs.x86_64
glusterfs-cli-3.12.2-12.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64
glusterfs-events-3.12.2-12.el7rhgs.x86_64
glusterfs-fuse-3.12.2-12.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64
glusterfs-libs-3.12.2-12.el7rhgs.x86_64
glusterfs-rdma-3.12.2-12.el7rhgs.x86_64
glusterfs-server-3.12.2-12.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
python2-gluster-3.12.2-12.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible
================

100% (actual number of affected IOPS charts may differ)

Steps to Reproduce
==================

1. Prepare gluster trusted storage pool with arbiter 2 plus 1x2 volume[1]
2. Install  WA using tendrl-ansible
3. Mount the volume on dedicated client machine[2]
4. On the client, download enwiki-latest-pages-articles.xml.bz2 file[3]
5. On the client, extract the tarball using wiki-export-split.py (attached)
   script into the volume:
   
~~~
# cd /mnt/volume
# bzcat /tmp/enwiki-latest-pages-articles.xml.bz2 | wiki-export-split.py --noredir --filenames=sha1 --sha1sum=wikipages.sha1 --max-files=10000
~~~

Note: This will extract about 10000 files with filenames based on sha1
checksum of the content (eg. tw5d0lzutwvkjgp77kh7vdbg8h29a8p.wikitext),
so that every brick of the volume is expected to host some files in the
end.

Note: you can also try to download it on the fly like this if you don't
have enough space on the client machine:

~~~
# curl http://mirror.example.com/enwiki-latest-pages-articles.xml.bz2 | bzcat | ./wiki-export-split.py --noredir --filenames=sha1 --sha1sum=wikipages.sha1 --max-files=10000
~~~
  
6. Wait for the extraction to complete.
7. Check IOPS dashboards for all brick of the volume (eg. for 6 machines
   with arbiter 2 plus 1x2 volume, there are 18 bricks).

[1] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.create.conf
[2] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.mount.conf
[3] use internal mirror linked in a private comment below, original source: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Actual results
==============

In my case, I see IOPS charts for brick with data for only 4 bricks out of 18,
with one chart reporting only part of the datapoints.

See:

 * screenshot 1 with IOPS chart showing data (note: this is an arbiter brick)
 * screenshot 2 with IOPS chart reporting zeroes while non zero values
   are expected (compare the IOPS chart with capacity utilization, which
   reports an increase)
 * screenshot 3 with IOPS chart without any data for given time frame
   (while non zero values are expected, compare with increase of capacity
   utilization)

Expected results
================

All IOPS charts for bricks of the volume reports some IOPS data (as the files
are named in a way so that every brick is hosting some files) for the whole
time, so that the IOPS spike is visible for all affected bricks.

Comment 1 Martin Bukatovic 2018-06-25 16:10:01 UTC
Reported during testing BZ 1581736.

Comment 3 Martin Bukatovic 2018-06-25 16:12:54 UTC
Created attachment 1454405 [details]
wiki export split script

Comment 4 Martin Bukatovic 2018-06-25 16:22:20 UTC
Created attachment 1454408 [details]
screenshot 1 (this looks ok)

Comment 5 Martin Bukatovic 2018-06-25 16:22:49 UTC
Created attachment 1454409 [details]
screenshot 2: zeroes reported

Comment 6 Martin Bukatovic 2018-06-25 16:23:12 UTC
Created attachment 1454411 [details]
screenshot 3: no data reported at all

Comment 7 Martin Bukatovic 2018-06-25 16:25:27 UTC
I really checked that there are data (as expected) on the bricks for which
no IOPS are reported, eg. the brick from screenshot 2:

```
[root@mbukatov-usm1-gl1 ~]# ls /mnt/brick_beta_arbiter_1/1 | wc -l
3005
```

Comment 9 gowtham 2018-06-26 06:30:58 UTC
@Martin can you please check if the extracted files are present in the other machines. Might be the case where the all the files are extracted in same machine.

Comment 10 Martin Bukatovic 2018-06-26 07:55:39 UTC
(In reply to gowtham from comment #9)
> @Martin can you please check if the extracted files are present in the other
> machines. Might be the case where the all the files are extracted in same
> machine.

I'm quite sure that all bricks were utilized, which I checked via ssh on few
machines (as noted in comment 7 for gl1 machine) and checked utilization charts
of all bricks both in WA (Brick Details page of the volume) and in Grafana in
all dashboards.

The point of extracting 10 000 files named using sha1 of their content is to
achieve uniform allocation of files across all bricks.

Comment 11 Martin Bukatovic 2018-06-26 08:35:52 UTC
Additional Information
======================

I scheduled the same workload to be run over night without limiting number of
extracted files:


```
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# bzcat /tmp/enwiki-latest-pages-articles.xml.bz2 | wiki-export-split.py --noredir --filenames=sha1 --sha1sum=wikipages.sha1
```

This means that the same rate of new files and data were stored on the volume
as described in the reproducer of this BZ, but for much longer time period
(over 12 hours).

And I noticed that *all IOPS charts* in At a Glance section of Brick Dashboards
reports data as expected.

Comment 12 Martin Bukatovic 2018-06-26 08:57:00 UTC
Created attachment 1454596 [details]
screenshot 4: short term vs long term workload

Attaching screenshot 4 providing evidence for comment 11.

This means that the problem is with reporting IOPS for too light or too short
workloads, which are reported only sometimes.

Long running and high IOPS workloads seems to be reported fine (after while).

Comment 15 Shubhendu Tripathi 2018-06-28 07:44:28 UTC
Created attachment 1455203 [details]
IOPS shooting up while writing small no of smaller files to the volume mount

Comment 18 Shubhendu Tripathi 2018-07-03 13:33:28 UTC
Created attachment 1456241 [details]
iops_while_no_of_small_file_being_written

Comment 19 Martin Bukatovic 2018-07-04 13:56:39 UTC
Will be verified with this limitation in mind:

> So mismatch between the starting points of trends coming up in grafana
> dashboard is in-evitable I feel due to these technical limitations.

That said, I expect to see some improvement here as well. We will need to
write down some known issue/limitation notice based on the result of the
testing.

Comment 23 Martin Bukatovic 2018-08-15 17:26:05 UTC
Testing with
============

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-6.el7rhgs.noarch
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-10.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-16.el7rhgs.x86_64
glusterfs-api-3.12.2-16.el7rhgs.x86_64
glusterfs-cli-3.12.2-16.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64
glusterfs-events-3.12.2-16.el7rhgs.x86_64
glusterfs-fuse-3.12.2-16.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64
glusterfs-libs-3.12.2-16.el7rhgs.x86_64
glusterfs-rdma-3.12.2-16.el7rhgs.x86_64
glusterfs-server-3.12.2-16.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
python2-gluster-3.12.2-16.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

Results
=======

When I perform the steps to reproduce, I see IOPS data reported on Brick
dashboards for all bricks of beta volume for this light workload (I checked
IOPS charts for all 18 bricks of "beta" volume, which is arbiter 2+1x2).

Comment 25 errata-xmlrpc 2018-09-04 07:07:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616


Note You need to log in before you can comment on or make changes to this bug.