Description of problem ====================== Most IOPS charts in At a Glance section of Brick Dashboards shows no data when I extract about 10000 files from tarball on arbiter 2 plus 1x2 volume[1]. Version-Release number ====================== tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch [root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-4.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort glusterfs-3.12.2-12.el7rhgs.x86_64 glusterfs-api-3.12.2-12.el7rhgs.x86_64 glusterfs-cli-3.12.2-12.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64 glusterfs-events-3.12.2-12.el7rhgs.x86_64 glusterfs-fuse-3.12.2-12.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64 glusterfs-libs-3.12.2-12.el7rhgs.x86_64 glusterfs-rdma-3.12.2-12.el7rhgs.x86_64 glusterfs-server-3.12.2-12.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64 python2-gluster-3.12.2-12.el7rhgs.x86_64 tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible ================ 100% (actual number of affected IOPS charts may differ) Steps to Reproduce ================== 1. Prepare gluster trusted storage pool with arbiter 2 plus 1x2 volume[1] 2. Install WA using tendrl-ansible 3. Mount the volume on dedicated client machine[2] 4. On the client, download enwiki-latest-pages-articles.xml.bz2 file[3] 5. On the client, extract the tarball using wiki-export-split.py (attached) script into the volume: ~~~ # cd /mnt/volume # bzcat /tmp/enwiki-latest-pages-articles.xml.bz2 | wiki-export-split.py --noredir --filenames=sha1 --sha1sum=wikipages.sha1 --max-files=10000 ~~~ Note: This will extract about 10000 files with filenames based on sha1 checksum of the content (eg. tw5d0lzutwvkjgp77kh7vdbg8h29a8p.wikitext), so that every brick of the volume is expected to host some files in the end. Note: you can also try to download it on the fly like this if you don't have enough space on the client machine: ~~~ # curl http://mirror.example.com/enwiki-latest-pages-articles.xml.bz2 | bzcat | ./wiki-export-split.py --noredir --filenames=sha1 --sha1sum=wikipages.sha1 --max-files=10000 ~~~ 6. Wait for the extraction to complete. 7. Check IOPS dashboards for all brick of the volume (eg. for 6 machines with arbiter 2 plus 1x2 volume, there are 18 bricks). [1] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.create.conf [2] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.mount.conf [3] use internal mirror linked in a private comment below, original source: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 Actual results ============== In my case, I see IOPS charts for brick with data for only 4 bricks out of 18, with one chart reporting only part of the datapoints. See: * screenshot 1 with IOPS chart showing data (note: this is an arbiter brick) * screenshot 2 with IOPS chart reporting zeroes while non zero values are expected (compare the IOPS chart with capacity utilization, which reports an increase) * screenshot 3 with IOPS chart without any data for given time frame (while non zero values are expected, compare with increase of capacity utilization) Expected results ================ All IOPS charts for bricks of the volume reports some IOPS data (as the files are named in a way so that every brick is hosting some files) for the whole time, so that the IOPS spike is visible for all affected bricks.
Reported during testing BZ 1581736.
Created attachment 1454405 [details] wiki export split script
Created attachment 1454408 [details] screenshot 1 (this looks ok)
Created attachment 1454409 [details] screenshot 2: zeroes reported
Created attachment 1454411 [details] screenshot 3: no data reported at all
I really checked that there are data (as expected) on the bricks for which no IOPS are reported, eg. the brick from screenshot 2: ``` [root@mbukatov-usm1-gl1 ~]# ls /mnt/brick_beta_arbiter_1/1 | wc -l 3005 ```
@Martin can you please check if the extracted files are present in the other machines. Might be the case where the all the files are extracted in same machine.
(In reply to gowtham from comment #9) > @Martin can you please check if the extracted files are present in the other > machines. Might be the case where the all the files are extracted in same > machine. I'm quite sure that all bricks were utilized, which I checked via ssh on few machines (as noted in comment 7 for gl1 machine) and checked utilization charts of all bricks both in WA (Brick Details page of the volume) and in Grafana in all dashboards. The point of extracting 10 000 files named using sha1 of their content is to achieve uniform allocation of files across all bricks.
Additional Information ====================== I scheduled the same workload to be run over night without limiting number of extracted files: ``` [root@mbukatov-usm1-client volume_beta_arbiter_2_plus_1x2]# bzcat /tmp/enwiki-latest-pages-articles.xml.bz2 | wiki-export-split.py --noredir --filenames=sha1 --sha1sum=wikipages.sha1 ``` This means that the same rate of new files and data were stored on the volume as described in the reproducer of this BZ, but for much longer time period (over 12 hours). And I noticed that *all IOPS charts* in At a Glance section of Brick Dashboards reports data as expected.
Created attachment 1454596 [details] screenshot 4: short term vs long term workload Attaching screenshot 4 providing evidence for comment 11. This means that the problem is with reporting IOPS for too light or too short workloads, which are reported only sometimes. Long running and high IOPS workloads seems to be reported fine (after while).
Created attachment 1455203 [details] IOPS shooting up while writing small no of smaller files to the volume mount
Created attachment 1456241 [details] iops_while_no_of_small_file_being_written
Will be verified with this limitation in mind: > So mismatch between the starting points of trends coming up in grafana > dashboard is in-evitable I feel due to these technical limitations. That said, I expect to see some improvement here as well. We will need to write down some known issue/limitation notice based on the result of the testing.
Testing with ============ [root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-10.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep tendrl | sort tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort glusterfs-3.12.2-16.el7rhgs.x86_64 glusterfs-api-3.12.2-16.el7rhgs.x86_64 glusterfs-cli-3.12.2-16.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64 glusterfs-events-3.12.2-16.el7rhgs.x86_64 glusterfs-fuse-3.12.2-16.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64 glusterfs-libs-3.12.2-16.el7rhgs.x86_64 glusterfs-rdma-3.12.2-16.el7rhgs.x86_64 glusterfs-server-3.12.2-16.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 python2-gluster-3.12.2-16.el7rhgs.x86_64 tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch Results ======= When I perform the steps to reproduce, I see IOPS data reported on Brick dashboards for all bricks of beta volume for this light workload (I checked IOPS charts for all 18 bricks of "beta" volume, which is arbiter 2+1x2).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616