Bug 1594899

Summary:

Most IOPS charts in At a Glance section of Brick Dashboards shows no data for short or light workloads

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Martin Bukatovic <mbukatov>

Component:

web-admin-tendrl-monitoring-integration

Assignee:

Shubhendu Tripathi <shtripat>

Status:

CLOSED ERRATA

QA Contact:

Martin Bukatovic <mbukatov>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.4

CC:

nthomas, rhs-bugs, sankarshan, shtripat

Target Milestone:

---

Target Release:

RHGS 3.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

tendrl-monitoring-integration-1.6.3-6.el7rhgs

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-09-04 07:07:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1503137

Attachments:

Description	Flags
wiki export split script	none
screenshot 1 (this looks ok)	none
screenshot 2: zeroes reported	none
screenshot 3: no data reported at all	none
screenshot 4: short term vs long term workload	none
IOPS shooting up while writing small no of smaller files to the volume mount	none
iops_while_no_of_small_file_being_written	none

Description Martin Bukatovic 2018-06-25 16:09:13 UTC

Description of problem
======================

Most IOPS charts in At a Glance section of Brick Dashboards shows no data when
I extract about 10000 files from tarball on arbiter 2 plus 1x2 volume[1].

Version-Release number
======================

tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]#  rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-12.el7rhgs.x86_64
glusterfs-api-3.12.2-12.el7rhgs.x86_64
glusterfs-cli-3.12.2-12.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64
glusterfs-events-3.12.2-12.el7rhgs.x86_64
glusterfs-fuse-3.12.2-12.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64
glusterfs-libs-3.12.2-12.el7rhgs.x86_64
glusterfs-rdma-3.12.2-12.el7rhgs.x86_64
glusterfs-server-3.12.2-12.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
python2-gluster-3.12.2-12.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible
================

100% (actual number of affected IOPS charts may differ)

Steps to Reproduce
==================

1. Prepare gluster trusted storage pool with arbiter 2 plus 1x2 volume[1]
2. Install  WA using tendrl-ansible
3. Mount the volume on dedicated client machine[2]
4. On the client, download enwiki-latest-pages-articles.xml.bz2 file[3]
5. On the client, extract the tarball using wiki-export-split.py (attached)
   script into the volume:
   
~~~
# cd /mnt/volume
# bzcat /tmp/enwiki-latest-pages-articles.xml.bz2 | wiki-export-split.py --noredir --filenames=sha1 --sha1sum=wikipages.sha1 --max-files=10000
~~~

Note: This will extract about 10000 files with filenames based on sha1
checksum of the content (eg. tw5d0lzutwvkjgp77kh7vdbg8h29a8p.wikitext),
so that every brick of the volume is expected to host some files in the
end.

Note: you can also try to download it on the fly like this if you don't
have enough space on the client machine:

~~~
# curl http://mirror.example.com/enwiki-latest-pages-articles.xml.bz2 | bzcat | ./wiki-export-split.py --noredir --filenames=sha1 --sha1sum=wikipages.sha1 --max-files=10000
~~~
  
6. Wait for the extraction to complete.
7. Check IOPS dashboards for all brick of the volume (eg. for 6 machines
   with arbiter 2 plus 1x2 volume, there are 18 bricks).

[1] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.create.conf
[2] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.mount.conf
[3] use internal mirror linked in a private comment below, original source: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Actual results
==============

In my case, I see IOPS charts for brick with data for only 4 bricks out of 18,
with one chart reporting only part of the datapoints.

See:

 * screenshot 1 with IOPS chart showing data (note: this is an arbiter brick)
 * screenshot 2 with IOPS chart reporting zeroes while non zero values
   are expected (compare the IOPS chart with capacity utilization, which
   reports an increase)
 * screenshot 3 with IOPS chart without any data for given time frame
   (while non zero values are expected, compare with increase of capacity
   utilization)

Expected results
================

All IOPS charts for bricks of the volume reports some IOPS data (as the files
are named in a way so that every brick is hosting some files) for the whole
time, so that the IOPS spike is visible for all affected bricks.

Comment 1 Martin Bukatovic 2018-06-25 16:10:01 UTC

Reported during testing BZ 1581736.

Comment 3 Martin Bukatovic 2018-06-25 16:12:54 UTC

Created attachment 1454405 [details]
wiki export split script

Comment 4 Martin Bukatovic 2018-06-25 16:22:20 UTC

Created attachment 1454408 [details]
screenshot 1 (this looks ok)

Comment 5 Martin Bukatovic 2018-06-25 16:22:49 UTC

Created attachment 1454409 [details]
screenshot 2: zeroes reported

Comment 6 Martin Bukatovic 2018-06-25 16:23:12 UTC

Created attachment 1454411 [details]
screenshot 3: no data reported at all

Comment 7 Martin Bukatovic 2018-06-25 16:25:27 UTC

I really checked that there are data (as expected) on the bricks for which
no IOPS are reported, eg. the brick from screenshot 2:

```
[root@mbukatov-usm1-gl1 ~]# ls /mnt/brick_beta_arbiter_1/1 | wc -l
3005
```

Comment 9 gowtham 2018-06-26 06:30:58 UTC

@Martin can you please check if the extracted files are present in the other machines. Might be the case where the all the files are extracted in same machine.

Comment 10 Martin Bukatovic 2018-06-26 07:55:39 UTC

(In reply to gowtham from comment #9)
> @Martin can you please check if the extracted files are present in the other
> machines. Might be the case where the all the files are extracted in same
> machine.

I'm quite sure that all bricks were utilized, which I checked via ssh on few
machines (as noted in comment 7 for gl1 machine) and checked utilization charts
of all bricks both in WA (Brick Details page of the volume) and in Grafana in
all dashboards.

The point of extracting 10 000 files named using sha1 of their content is to
achieve uniform allocation of files across all bricks.

Comment 11 Martin Bukatovic 2018-06-26 08:35:52 UTC

Additional Information
======================

I scheduled the same workload to be run over night without limiting number of
extracted files:


```
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# bzcat /tmp/enwiki-latest-pages-articles.xml.bz2 | wiki-export-split.py --noredir --filenames=sha1 --sha1sum=wikipages.sha1
```

This means that the same rate of new files and data were stored on the volume
as described in the reproducer of this BZ, but for much longer time period
(over 12 hours).

And I noticed that *all IOPS charts* in At a Glance section of Brick Dashboards
reports data as expected.

Comment 12 Martin Bukatovic 2018-06-26 08:57:00 UTC

Created attachment 1454596 [details]
screenshot 4: short term vs long term workload

Attaching screenshot 4 providing evidence for comment 11.

This means that the problem is with reporting IOPS for too light or too short
workloads, which are reported only sometimes.

Long running and high IOPS workloads seems to be reported fine (after while).

Comment 15 Shubhendu Tripathi 2018-06-28 07:44:28 UTC

Created attachment 1455203 [details]
IOPS shooting up while writing small no of smaller files to the volume mount

Comment 18 Shubhendu Tripathi 2018-07-03 13:33:28 UTC

Created attachment 1456241 [details]
iops_while_no_of_small_file_being_written

Comment 19 Martin Bukatovic 2018-07-04 13:56:39 UTC

Will be verified with this limitation in mind:

> So mismatch between the starting points of trends coming up in grafana
> dashboard is in-evitable I feel due to these technical limitations.

That said, I expect to see some improvement here as well. We will need to
write down some known issue/limitation notice based on the result of the
testing.

Comment 23 Martin Bukatovic 2018-08-15 17:26:05 UTC

Testing with
============

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-6.el7rhgs.noarch
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-10.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-16.el7rhgs.x86_64
glusterfs-api-3.12.2-16.el7rhgs.x86_64
glusterfs-cli-3.12.2-16.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64
glusterfs-events-3.12.2-16.el7rhgs.x86_64
glusterfs-fuse-3.12.2-16.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64
glusterfs-libs-3.12.2-16.el7rhgs.x86_64
glusterfs-rdma-3.12.2-16.el7rhgs.x86_64
glusterfs-server-3.12.2-16.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
python2-gluster-3.12.2-16.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

Results
=======

When I perform the steps to reproduce, I see IOPS data reported on Brick
dashboards for all bricks of beta volume for this light workload (I checked
IOPS charts for all 18 bricks of "beta" volume, which is arbiter 2+1x2).

Comment 25 errata-xmlrpc 2018-09-04 07:07:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616