Bug 1594383 - IOPS chart shows no data points after writing of a file fails on unavailable free space on a brick
Summary: IOPS chart shows no data points after writing of a file fails on unavailable ...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-monitoring-integration
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Shubhendu Tripathi
QA Contact: Martin Bukatovic
URL:
Whiteboard:
Depends On:
Blocks: 1503137
TreeView+ depends on / blocked
 
Reported: 2018-06-22 19:19 UTC by Martin Bukatovic
Modified: 2018-09-18 08:02 UTC (History)
7 users (show)

Fixed In Version: tendrl-monitoring-integration-1.6.3-6.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-23 11:58:11 UTC
Embargoed:


Attachments (Terms of Use)
screenshot 1: cluster dashboard with IOPS chart without new data points (168.76 KB, image/png)
2018-06-22 19:28 UTC, Martin Bukatovic
no flags Details
tarball with output of gluster volume profile info command before, during and after the problem (19.78 KB, application/x-gzip)
2018-06-22 19:40 UTC, Martin Bukatovic
no flags Details
zero_iops_after_failed_data_on_volume_mount (32.20 KB, image/png)
2018-07-03 12:17 UTC, Shubhendu Tripathi
no flags Details
screenshot of cluster dasboard from verification (see comment 13) (228.57 KB, image/png)
2018-08-22 08:36 UTC, Martin Bukatovic
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github Tendrl monitoring-integration issues 504 0 None None None 2018-07-03 13:50:02 UTC
Github Tendrl monitoring-integration issues 505 0 None None None 2018-07-03 13:50:24 UTC
Red Hat Bugzilla 1581736 0 unspecified CLOSED IOPS metric is not intuitive enough 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1594342 0 unspecified CLOSED growth of client connections when writing of a file fails on unavailable free space on a brick, file remains on the bric... 2021-02-22 00:41:40 UTC

Internal Links: 1581736 1594342

Description Martin Bukatovic 2018-06-22 19:19:50 UTC
Description of problem
======================

I noticed this problem when reporting BZ 1594342 and testing BZ 1581736:

When I copy a file, which is so large that it doesn't fit into it's brick,
to a gluster volume, IOPS chart on cluster dashboard shows no new data points
between moment when the file copy fails and moment when the file is deleted
from the volume.

Version-Release number
=======================

tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]#  rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-12.el7rhgs.x86_64
glusterfs-api-3.12.2-12.el7rhgs.x86_64
glusterfs-cli-3.12.2-12.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64
glusterfs-events-3.12.2-12.el7rhgs.x86_64
glusterfs-fuse-3.12.2-12.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64
glusterfs-libs-3.12.2-12.el7rhgs.x86_64
glusterfs-rdma-3.12.2-12.el7rhgs.x86_64
glusterfs-server-3.12.2-12.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
python2-gluster-3.12.2-12.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible
================

100%

Steps to Reproduce
==================

1. prepare gluster trusted storage pool with arbiter 2 plus 1x2 volume[1]
2. install  WA using tendrl-ansible
3. mount the volume on dedicated client machine[2]
4. on the client, copy large tarball (so that it doesn't fit on any brick)
   in the volume while observing IOPS chart on default Cluster Dashboard
5. wait for the copy operation to fail because of insufficient space
   on the brick
6. wait about halt an hour, observing the dashboard
7. on the client, remove the large tarball from the volume
8. wait about 15 minutes, observing the dashboard

[1] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.create.conf
[2] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.mount.conf

Actual results
==============

When the file copy operation fails (step 5), data point in IOPS chart on the
dashboard stops to appear, while the other charts are still updated.

When the file is removed from the volume, data points are updated again (zere
values are expected after that, if you don't do anything else with the storage
machines).

Expected results
================

IOPS data points are reported the whole time.

Additional info
===============

It's possible that this problem is caused by BZ 1594342, but since WA
gets information about IOPS from the profiling feature, which still works
fine, so I don't see any reason for WA to not report any values.

I'm going to attach outputs from `gluster v profile volume_beta_arbiter_2_plus_1x2 info` command.

Comment 1 Martin Bukatovic 2018-06-22 19:28:22 UTC
Created attachment 1453842 [details]
screenshot 1: cluster dashboard with IOPS chart without new data points

Note that there is a spike in IOPS chart (this is the copying of the large file),
which out of sudden stops (when the operation fails), while the other chars are
still updated.

Comment 2 Martin Bukatovic 2018-06-22 19:40:06 UTC
Created attachment 1453844 [details]
tarball with output of gluster volume profile info command before, during and after the problem

Comment 5 Shubhendu Tripathi 2018-07-03 12:17:04 UTC
Created attachment 1456229 [details]
zero_iops_after_failed_data_on_volume_mount

Comment 10 Shubhendu Tripathi 2018-07-09 12:53:42 UTC
with latest changes I see brick utilization as well IOPS in grafana showing increase in graphs. Once copying of huge file fails on the brick, the utilization remains at the same and IOPS falls back to zero and remains zero then on-wards. There is no gaps like no data reported in grafana anymore.

Comment 12 Martin Bukatovic 2018-08-15 16:13:07 UTC
Testing with
============

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-6.el7rhgs.noarch
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-10.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-16.el7rhgs.x86_64
glusterfs-api-3.12.2-16.el7rhgs.x86_64
glusterfs-cli-3.12.2-16.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64
glusterfs-events-3.12.2-16.el7rhgs.x86_64
glusterfs-fuse-3.12.2-16.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64
glusterfs-libs-3.12.2-16.el7rhgs.x86_64
glusterfs-rdma-3.12.2-16.el7rhgs.x86_64
glusterfs-server-3.12.2-16.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
python2-gluster-3.12.2-16.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

Results
=======

I no longer see the original problem on the gluster side, so this can't be
reproduced with latest RHGS 3.4 builds.

See details (and screenshot) in comment under the gluster BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1594342#c16

Comment 13 Martin Bukatovic 2018-08-22 08:33:53 UTC
Testing with latest WA and bit older GlusterFS (affected by BZ 1594342)
=======================================================================

Testing with GlusterFS builds affected by BZ 1594342 is necessary to understand
if the problem has been actually fixed on WA side as this BZ claims.

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-ansible-1.6.3-7.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-ui-1.6.3-11.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-12.el7rhgs.x86_64
glusterfs-api-3.12.2-12.el7rhgs.x86_64
glusterfs-cli-3.12.2-12.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64
glusterfs-events-3.12.2-12.el7rhgs.x86_64
glusterfs-fuse-3.12.2-12.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64
glusterfs-libs-3.12.2-12.el7rhgs.x86_64
glusterfs-rdma-3.12.2-12.el7rhgs.x86_64
glusterfs-server-3.12.2-12.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
python2-gluster-3.12.2-12.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

Results
=======

I followed the steps to reproduce, running into the gluster bug (BZ 1594342) as
expected:

```
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# dd if=/dev/zero of=zero90GBfile bs=100M count=900                                                 
dd: error writing ‘zero90GBfile’: Transport endpoint is not connected
dd: closing output file ‘zero90GBfile’: Transport endpoint is not connected
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# ls -lh
total 30G
-rw-r--r--. 1 root root 30G Aug 21 21:13 zero90GBfile
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# rm zero90GBfile 
rm: remove regular file ‘zero90GBfile’? y
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]#
```

```
[root@mbukatov-usm1-gl4 ~]# find /mnt/brick_* -name unlink | xargs tree -h
/mnt/brick_beta_arbiter_1/1/.glusterfs/unlink
`-- [ 30G]  768748bd-f970-48cb-99a1-32a7bb0c3213
/mnt/brick_beta_arbiter_2/2/.glusterfs/unlink
/mnt/brick_beta_arbiter_3/3/.glusterfs/unlink
/mnt/brick_gama_disperse_1/1/.glusterfs/unlink
/mnt/brick_gama_disperse_2/2/.glusterfs/unlink

0 directories, 1 file
```

And WA dashboard stops reporting IOPS values when the data transfer fails on
insufficient space on a brick.

The expected result here is that IOPS values are reported during the whole time.

Comment 15 Martin Bukatovic 2018-08-22 08:36:11 UTC
Created attachment 1477816 [details]
screenshot of cluster dasboard from verification (see comment 13)


Note You need to log in before you can comment on or make changes to this bug.