Description of problem ====================== I noticed this problem when reporting BZ 1594342 and testing BZ 1581736: When I copy a file, which is so large that it doesn't fit into it's brick, to a gluster volume, IOPS chart on cluster dashboard shows no new data points between moment when the file copy fails and moment when the file is deleted from the volume. Version-Release number ======================= tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch [root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-4.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort glusterfs-3.12.2-12.el7rhgs.x86_64 glusterfs-api-3.12.2-12.el7rhgs.x86_64 glusterfs-cli-3.12.2-12.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64 glusterfs-events-3.12.2-12.el7rhgs.x86_64 glusterfs-fuse-3.12.2-12.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64 glusterfs-libs-3.12.2-12.el7rhgs.x86_64 glusterfs-rdma-3.12.2-12.el7rhgs.x86_64 glusterfs-server-3.12.2-12.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64 python2-gluster-3.12.2-12.el7rhgs.x86_64 tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible ================ 100% Steps to Reproduce ================== 1. prepare gluster trusted storage pool with arbiter 2 plus 1x2 volume[1] 2. install WA using tendrl-ansible 3. mount the volume on dedicated client machine[2] 4. on the client, copy large tarball (so that it doesn't fit on any brick) in the volume while observing IOPS chart on default Cluster Dashboard 5. wait for the copy operation to fail because of insufficient space on the brick 6. wait about halt an hour, observing the dashboard 7. on the client, remove the large tarball from the volume 8. wait about 15 minutes, observing the dashboard [1] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.create.conf [2] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.mount.conf Actual results ============== When the file copy operation fails (step 5), data point in IOPS chart on the dashboard stops to appear, while the other charts are still updated. When the file is removed from the volume, data points are updated again (zere values are expected after that, if you don't do anything else with the storage machines). Expected results ================ IOPS data points are reported the whole time. Additional info =============== It's possible that this problem is caused by BZ 1594342, but since WA gets information about IOPS from the profiling feature, which still works fine, so I don't see any reason for WA to not report any values. I'm going to attach outputs from `gluster v profile volume_beta_arbiter_2_plus_1x2 info` command.
Created attachment 1453842 [details] screenshot 1: cluster dashboard with IOPS chart without new data points Note that there is a spike in IOPS chart (this is the copying of the large file), which out of sudden stops (when the operation fails), while the other chars are still updated.
Created attachment 1453844 [details] tarball with output of gluster volume profile info command before, during and after the problem
Created attachment 1456229 [details] zero_iops_after_failed_data_on_volume_mount
with latest changes I see brick utilization as well IOPS in grafana showing increase in graphs. Once copying of huge file fails on the brick, the utilization remains at the same and IOPS falls back to zero and remains zero then on-wards. There is no gaps like no data reported in grafana anymore.
Testing with ============ [root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-10.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort glusterfs-3.12.2-16.el7rhgs.x86_64 glusterfs-api-3.12.2-16.el7rhgs.x86_64 glusterfs-cli-3.12.2-16.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64 glusterfs-events-3.12.2-16.el7rhgs.x86_64 glusterfs-fuse-3.12.2-16.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64 glusterfs-libs-3.12.2-16.el7rhgs.x86_64 glusterfs-rdma-3.12.2-16.el7rhgs.x86_64 glusterfs-server-3.12.2-16.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 python2-gluster-3.12.2-16.el7rhgs.x86_64 tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch Results ======= I no longer see the original problem on the gluster side, so this can't be reproduced with latest RHGS 3.4 builds. See details (and screenshot) in comment under the gluster BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1594342#c16
Testing with latest WA and bit older GlusterFS (affected by BZ 1594342) ======================================================================= Testing with GlusterFS builds affected by BZ 1594342 is necessary to understand if the problem has been actually fixed on WA side as this BZ claims. [root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-ansible-1.6.3-7.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-ui-1.6.3-11.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort glusterfs-3.12.2-12.el7rhgs.x86_64 glusterfs-api-3.12.2-12.el7rhgs.x86_64 glusterfs-cli-3.12.2-12.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64 glusterfs-events-3.12.2-12.el7rhgs.x86_64 glusterfs-fuse-3.12.2-12.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64 glusterfs-libs-3.12.2-12.el7rhgs.x86_64 glusterfs-rdma-3.12.2-12.el7rhgs.x86_64 glusterfs-server-3.12.2-12.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64 python2-gluster-3.12.2-12.el7rhgs.x86_64 tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch Results ======= I followed the steps to reproduce, running into the gluster bug (BZ 1594342) as expected: ``` [root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# dd if=/dev/zero of=zero90GBfile bs=100M count=900 dd: error writing ‘zero90GBfile’: Transport endpoint is not connected dd: closing output file ‘zero90GBfile’: Transport endpoint is not connected [root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# ls -lh total 30G -rw-r--r--. 1 root root 30G Aug 21 21:13 zero90GBfile [root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# rm zero90GBfile rm: remove regular file ‘zero90GBfile’? y [root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# ``` ``` [root@mbukatov-usm1-gl4 ~]# find /mnt/brick_* -name unlink | xargs tree -h /mnt/brick_beta_arbiter_1/1/.glusterfs/unlink `-- [ 30G] 768748bd-f970-48cb-99a1-32a7bb0c3213 /mnt/brick_beta_arbiter_2/2/.glusterfs/unlink /mnt/brick_beta_arbiter_3/3/.glusterfs/unlink /mnt/brick_gama_disperse_1/1/.glusterfs/unlink /mnt/brick_gama_disperse_2/2/.glusterfs/unlink 0 directories, 1 file ``` And WA dashboard stops reporting IOPS values when the data transfer fails on insufficient space on a brick. The expected result here is that IOPS values are reported during the whole time.
Created attachment 1477816 [details] screenshot of cluster dasboard from verification (see comment 13)