1594383 – IOPS chart shows no data points after writing of a file fails on unavailable free space on a brick

Bug 1594383 - IOPS chart shows no data points after writing of a file fails on unavailable free space on a brick

Summary: IOPS chart shows no data points after writing of a file fails on unavailable ...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-monitoring-integration
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Shubhendu Tripathi
QA Contact:	Martin Bukatovic
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2018-06-22 19:19 UTC by Martin Bukatovic
Modified:	2018-09-18 08:02 UTC (History)
CC List:	7 users (show)
Fixed In Version:	tendrl-monitoring-integration-1.6.3-6.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-23 11:58:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
screenshot 1: cluster dashboard with IOPS chart without new data points (168.76 KB, image/png) 2018-06-22 19:28 UTC, Martin Bukatovic	no flags	Details
tarball with output of gluster volume profile info command before, during and after the problem (19.78 KB, application/x-gzip) 2018-06-22 19:40 UTC, Martin Bukatovic	no flags	Details
zero_iops_after_failed_data_on_volume_mount (32.20 KB, image/png) 2018-07-03 12:17 UTC, Shubhendu Tripathi	no flags	Details
screenshot of cluster dasboard from verification (see comment 13) (228.57 KB, image/png) 2018-08-22 08:36 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	Tendrl monitoring-integration issues 504	None	None	None	2018-07-03 13:50:02 UTC
Github	Tendrl monitoring-integration issues 505	None	None	None	2018-07-03 13:50:24 UTC
Red Hat Bugzilla	1581736	unspecified	CLOSED	IOPS metric is not intuitive enough	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1594342	unspecified	CLOSED	growth of client connections when writing of a file fails on unavailable free space on a brick, file remains on the bric...	2021-02-22 00:41:40 UTC

Internal Links: 1581736 1594342

Description Martin Bukatovic 2018-06-22 19:19:50 UTC

Description of problem
======================

I noticed this problem when reporting BZ 1594342 and testing BZ 1581736:

When I copy a file, which is so large that it doesn't fit into it's brick,
to a gluster volume, IOPS chart on cluster dashboard shows no new data points
between moment when the file copy fails and moment when the file is deleted
from the volume.

Version-Release number
=======================

tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]#  rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-12.el7rhgs.x86_64
glusterfs-api-3.12.2-12.el7rhgs.x86_64
glusterfs-cli-3.12.2-12.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64
glusterfs-events-3.12.2-12.el7rhgs.x86_64
glusterfs-fuse-3.12.2-12.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64
glusterfs-libs-3.12.2-12.el7rhgs.x86_64
glusterfs-rdma-3.12.2-12.el7rhgs.x86_64
glusterfs-server-3.12.2-12.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
python2-gluster-3.12.2-12.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible
================

100%

Steps to Reproduce
==================

1. prepare gluster trusted storage pool with arbiter 2 plus 1x2 volume[1]
2. install  WA using tendrl-ansible
3. mount the volume on dedicated client machine[2]
4. on the client, copy large tarball (so that it doesn't fit on any brick)
   in the volume while observing IOPS chart on default Cluster Dashboard
5. wait for the copy operation to fail because of insufficient space
   on the brick
6. wait about halt an hour, observing the dashboard
7. on the client, remove the large tarball from the volume
8. wait about 15 minutes, observing the dashboard

[1] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.create.conf
[2] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.mount.conf

Actual results
==============

When the file copy operation fails (step 5), data point in IOPS chart on the
dashboard stops to appear, while the other charts are still updated.

When the file is removed from the volume, data points are updated again (zere
values are expected after that, if you don't do anything else with the storage
machines).

Expected results
================

IOPS data points are reported the whole time.

Additional info
===============

It's possible that this problem is caused by BZ 1594342, but since WA
gets information about IOPS from the profiling feature, which still works
fine, so I don't see any reason for WA to not report any values.

I'm going to attach outputs from `gluster v profile volume_beta_arbiter_2_plus_1x2 info` command.

Comment 1 Martin Bukatovic 2018-06-22 19:28:22 UTC

Created attachment 1453842 [details]
screenshot 1: cluster dashboard with IOPS chart without new data points

Note that there is a spike in IOPS chart (this is the copying of the large file),
which out of sudden stops (when the operation fails), while the other chars are
still updated.

Comment 2 Martin Bukatovic 2018-06-22 19:40:06 UTC

Created attachment 1453844 [details]
tarball with output of gluster volume profile info command before, during and after the problem

Comment 5 Shubhendu Tripathi 2018-07-03 12:17:04 UTC

Created attachment 1456229 [details]
zero_iops_after_failed_data_on_volume_mount

Comment 10 Shubhendu Tripathi 2018-07-09 12:53:42 UTC

with latest changes I see brick utilization as well IOPS in grafana showing increase in graphs. Once copying of huge file fails on the brick, the utilization remains at the same and IOPS falls back to zero and remains zero then on-wards. There is no gaps like no data reported in grafana anymore.

Comment 12 Martin Bukatovic 2018-08-15 16:13:07 UTC

Testing with
============

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-6.el7rhgs.noarch
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-10.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-16.el7rhgs.x86_64
glusterfs-api-3.12.2-16.el7rhgs.x86_64
glusterfs-cli-3.12.2-16.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64
glusterfs-events-3.12.2-16.el7rhgs.x86_64
glusterfs-fuse-3.12.2-16.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64
glusterfs-libs-3.12.2-16.el7rhgs.x86_64
glusterfs-rdma-3.12.2-16.el7rhgs.x86_64
glusterfs-server-3.12.2-16.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
python2-gluster-3.12.2-16.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

Results
=======

I no longer see the original problem on the gluster side, so this can't be
reproduced with latest RHGS 3.4 builds.

See details (and screenshot) in comment under the gluster BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1594342#c16

Comment 13 Martin Bukatovic 2018-08-22 08:33:53 UTC

Testing with latest WA and bit older GlusterFS (affected by BZ 1594342)
=======================================================================

Testing with GlusterFS builds affected by BZ 1594342 is necessary to understand
if the problem has been actually fixed on WA side as this BZ claims.

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-ansible-1.6.3-7.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-ui-1.6.3-11.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-12.el7rhgs.x86_64
glusterfs-api-3.12.2-12.el7rhgs.x86_64
glusterfs-cli-3.12.2-12.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64
glusterfs-events-3.12.2-12.el7rhgs.x86_64
glusterfs-fuse-3.12.2-12.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64
glusterfs-libs-3.12.2-12.el7rhgs.x86_64
glusterfs-rdma-3.12.2-12.el7rhgs.x86_64
glusterfs-server-3.12.2-12.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
python2-gluster-3.12.2-12.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

Results
=======

I followed the steps to reproduce, running into the gluster bug (BZ 1594342) as
expected:

```
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# dd if=/dev/zero of=zero90GBfile bs=100M count=900                                                 
dd: error writing ‘zero90GBfile’: Transport endpoint is not connected
dd: closing output file ‘zero90GBfile’: Transport endpoint is not connected
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# ls -lh
total 30G
-rw-r--r--. 1 root root 30G Aug 21 21:13 zero90GBfile
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# rm zero90GBfile 
rm: remove regular file ‘zero90GBfile’? y
[root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]#
```

```
[root@mbukatov-usm1-gl4 ~]# find /mnt/brick_* -name unlink | xargs tree -h
/mnt/brick_beta_arbiter_1/1/.glusterfs/unlink
`-- [ 30G]  768748bd-f970-48cb-99a1-32a7bb0c3213
/mnt/brick_beta_arbiter_2/2/.glusterfs/unlink
/mnt/brick_beta_arbiter_3/3/.glusterfs/unlink
/mnt/brick_gama_disperse_1/1/.glusterfs/unlink
/mnt/brick_gama_disperse_2/2/.glusterfs/unlink

0 directories, 1 file
```

And WA dashboard stops reporting IOPS values when the data transfer fails on
insufficient space on a brick.

The expected result here is that IOPS values are reported during the whole time.

Comment 15 Martin Bukatovic 2018-08-22 08:36:11 UTC

Created attachment 1477816 [details]
screenshot of cluster dasboard from verification (see comment 13)

Note You need to log in before you can comment on or make changes to this bug.