Description of problem ====================== When I copy a file, which is so large that it doesn't fit into it's brick, to a gluster volume, I observe few problems: * when the copy operations fails on the insufficient space, the number of gluster connections starts to go up (as reported by gluster volume status <volname> clients) * the growth stops when the file in question is deleted, but then the number of connection stays at the given level (too high compared to original state) * when the file is deleted from the volume, the free space on the brick is not reclaimed, as the file is still available in <bricdir>/.glusterfs/unlink/ directory) Version-Release number of selected component ============================================ [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort glusterfs-3.12.2-12.el7rhgs.x86_64 glusterfs-api-3.12.2-12.el7rhgs.x86_64 glusterfs-cli-3.12.2-12.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64 glusterfs-events-3.12.2-12.el7rhgs.x86_64 glusterfs-fuse-3.12.2-12.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64 glusterfs-libs-3.12.2-12.el7rhgs.x86_64 glusterfs-rdma-3.12.2-12.el7rhgs.x86_64 glusterfs-server-3.12.2-12.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64 python2-gluster-3.12.2-12.el7rhgs.x86_64 tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible ================ 100% Steps to Reproduce ================== 1. prepare 6 machines with at least 3 storage devices on each for future bricks 2. prepare dedicated client machine 3. prepare dedicated machine for WA 4. install gluster, create trusted storage pool with arbiter 2 plus 1x2 volume[1] 5. mount the volume on the client[2] 6. install RHGS WA on the dedicated machine 7. import trusted storage pool into WA for monitoring purposes 8. on the client, copy large file into the volume, this file should be so large that it doesn't fit into the bricks you have in the volume 9. while the copying is in progress, monitor: * output of `gluster volume status <volname> clients` * state dump from `gluster get-state detail` * number of connections on cluster dashboard of WA 9. when the copy operation fails, wait about one hour, monitoring the same details as in previous step 10. on the client, run rm /mnt/<volname>/<bigfile> to remove the big file from the volume 11. wait at least for 15 minutes, monitoring the details as in step 9 [1] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.create.conf [2] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.mount.conf Actual results ============== When copying of the large file fails (step #9) as expected: ``` [root@mbukatov-usm1-client tmp]# cp /tmp/enwiki-latest-pages-articles.xml.bz2 /mnt/volume_beta_arbiter_2_plus_1x2/ cp: error writing ‘/mnt/volume_beta_arbiter_2_plus_1x2/enwiki-latest-pages-articles.xml.bz2’: Transport endpoint is not connected cp: failed to extend ‘/mnt/volume_beta_arbiter_2_plus_1x2/enwiki-latest-pages-articles.xml.bz2’: Transport endpoint is not connected cp: failed to close ‘/mnt/volume_beta_arbiter_2_plus_1x2/enwiki-latest-pages-articles.xml.bz2’: Transport endpoint is not connected [root@mbukatov-usm1-client tmp]# echo $? 1 ``` Then the number of gluster client connections starts to grow (see screenshot 1 from cluster dashboard of RHGS WA). When the file is removed from the volume, the number of gluster client connections stops to grow, but it remains on it's current level (see attached screenshot 1). Moreover the removed file still available on the bricks, blocking free space there: ``` [root@mbukatov-usm1-gl1 ~]# ls -lh /mnt/brick_beta_arbiter_3/3/.glusterfs/unlink/ total 9.9G -rw-r--r--. 1 root root 9.9G Jun 22 15:09 efcc3fbc-57a7-40fa-8a2f-e3932509338b ``` While on the client, there is no such file: ``` [root@mbukatov-usm1-client ~]# ls -l /mnt/volume_beta_arbiter_2_plus_1x2/ total 0 ``` See attached outputs from `gluster volume status <volname> clients` and `gluster get-state detail` commands: * during copying * after the copying fails * after the file is removed Expected results ================ Then number of gluster client connections doesn't start to grow when the file fails to be copied to the volume because of insufficient free space on a brick. I would expect the file to be actually removed when I delete it from the volume, but I may be wrong here as I'm not fully familiar with the unlink feature. Additional info =============== I have tried this few time, once I let the client connections to grow over night (see the screenshot 2 for details).
Created attachment 1453784 [details] screenshot 1: Default dashboard with "Connection" chart
Created attachment 1453785 [details] screenshot 2: Connection chart (when running over night)
Screenshot 2 notes: the growth stops there when the file is removed.
Created attachment 1453787 [details] tarball with the output files from gluster commans
Thank you, Martin! I observe a lot of client connections coming in from 10.37.169.62. Is 10.37.169.62 a client machine? Would it be possible to attach sosreport from that machine?
10.37.169.62 is storage machine mbukatov-usm1-gl3, which seems to host arbiter brick for the volume: ``` [root@mbukatov-usm1-gl1 ~]# gluster volume info volume_beta_arbiter_2_plus_1x2 | grep gl3 Brick3: mbukatov-usm1-gl3:/mnt/brick_beta_arbiter_1/1 (arbiter) Brick9: mbukatov-usm1-gl3:/mnt/brick_beta_arbiter_2/2 (arbiter) Brick15: mbukatov-usm1-gl3:/mnt/brick_beta_arbiter_3/3 (arbiter) ```
Would it be possible to collect sosreport from the arbiter node? Thx!
Created attachment 1453838 [details] screenshot 3: host dashboard for gl3 machine While I'm waiting for the sosreport to finish, I noticed high memory usage on that node, which is likely caused by lots of glfsheal processes: ``` # ps aux | grep glfsheal | wc -l 121 ``` The memory spike clearly correlates with the failed upload of the big file, as can be seen on the screenshot 3 from WA host dashboard attached.
@Ravi, can you please check and comment as no of glfsheal processes is going up in this case.
glfsheal is a binary that is launched when you run 'gluster volume heal $VOLNAME info'. It is basically a client process (AFR + protocol/client translator) that connects to the bricks of $VOLNAME, gets the list of files that need heal, print it to stdout and then terminates (thus closing the connections it established with the bricks). comment #8 shows 121 glfsheal processes running, which probably means we ran the heal-info command multiple (121) times and each time the process did not terminate successfully. If this is not a transient state, we would need to find out if and why 'gluster volume heal $VOLNAME info' is not completing successfully. Note: I'm currently engaged in RHGS stop gap related work and can take a look at the setup next week (not clearing the need-info until then). In the meantime, if anyone wants to take a stab at this you can run 'glfsheal $VOLNAME' and see if and why its hung using gdb, looking at the log file (/var/log/glusterfs/glfsheal-$VOLNAME.log) etc.
Created attachment 1476204 [details] screenshot 4: retrying with recent gluster build, with fix for heal-info hang (In reply to Ravishankar N from comment #14) > There was a recent fix for heal-info hang > https://bugzilla.redhat.com/show_bug.cgi?id=1597654#c3 which should be > available in the next downstream build of rhgs-3.4.0 (v3.12.2-14?). It might > be worth seeing if the issue is reproducible with that build. I retried the scenario during testing of related RHGS WA BZ 1594383 and I no longer see the problem, so it seems that this problem has been addressed by this recent fix of heal-info as well. See screenshot 4. Version of gluster used: [root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort glusterfs-3.12.2-16.el7rhgs.x86_64 glusterfs-api-3.12.2-16.el7rhgs.x86_64 glusterfs-cli-3.12.2-16.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64 glusterfs-events-3.12.2-16.el7rhgs.x86_64 glusterfs-fuse-3.12.2-16.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64 glusterfs-libs-3.12.2-16.el7rhgs.x86_64 glusterfs-rdma-3.12.2-16.el7rhgs.x86_64 glusterfs-server-3.12.2-16.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 python2-gluster-3.12.2-16.el7rhgs.x86_64 tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
Additional details ================== This is more a note about RHGS WA and data reported on Brick Dashboard for this use case. Nothing described here is a bug. While the file is no longer stuck in .glusterfs/unlink directory: ``` [root@mbukatov-usm1-gl1 ~]# find /mnt/brick_* -name unlink | xargs tree /mnt/brick_beta_arbiter_1/1/.glusterfs/unlink /mnt/brick_beta_arbiter_2/2/.glusterfs/unlink /mnt/brick_beta_arbiter_3/3/.glusterfs/unlink /mnt/brick_gama_disperse_1/1/.glusterfs/unlink /mnt/brick_gama_disperse_2/2/.glusterfs/unlink 0 directories, 0 files ``` The underlying thin volume reports 99% utilization: ``` [root@mbukatov-usm1-gl1 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert lv_beta_arbiter_1 vg_beta_arbiter_1 Vwi-aot--- <9.95g pool_beta_arbiter_1 0.72 pool_beta_arbiter_1 vg_beta_arbiter_1 twi-aot--- <9.95g 0.72 0.11 lv_beta_arbiter_2 vg_beta_arbiter_2 Vwi-aot--- <9.95g pool_beta_arbiter_2 0.74 pool_beta_arbiter_2 vg_beta_arbiter_2 twi-aot--- <9.95g 0.74 0.11 lv_beta_arbiter_3 vg_beta_arbiter_3 Vwi-aot--- <9.95g pool_beta_arbiter_3 98.76 pool_beta_arbiter_3 vg_beta_arbiter_3 twi-aot--- <9.95g 98.76 2.43 lv_gama_disperse_1 vg_gama_disperse_1 Vwi-aot--- <9.95g pool_gama_disperse_1 0.17 pool_gama_disperse_1 vg_gama_disperse_1 twi-aot--- <9.95g 0.17 0.09 lv_gama_disperse_2 vg_gama_disperse_2 Vwi-aot--- 19.89g pool_gama_disperse_2 0.09 pool_gama_disperse_2 vg_gama_disperse_2 twi-aot--- 19.89g 0.09 0.06 ``` But when I run `fstrim /mnt/brick_beta_arbiter_3`, the space is reclaimed: ``` [root@mbukatov-usm1-gl1 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert lv_beta_arbiter_1 vg_beta_arbiter_1 Vwi-aot--- <9.95g pool_beta_arbiter_1 0.72 pool_beta_arbiter_1 vg_beta_arbiter_1 twi-aot--- <9.95g 0.72 0.11 lv_beta_arbiter_2 vg_beta_arbiter_2 Vwi-aot--- <9.95g pool_beta_arbiter_2 0.74 pool_beta_arbiter_2 vg_beta_arbiter_2 twi-aot--- <9.95g 0.74 0.11 lv_beta_arbiter_3 vg_beta_arbiter_3 Vwi-aot--- <9.95g pool_beta_arbiter_3 0.70 pool_beta_arbiter_3 vg_beta_arbiter_3 twi-aot--- <9.95g 0.70 0.11 lv_gama_disperse_1 vg_gama_disperse_1 Vwi-aot--- <9.95g pool_gama_disperse_1 0.17 pool_gama_disperse_1 vg_gama_disperse_1 twi-aot--- <9.95g 0.17 0.09 lv_gama_disperse_2 vg_gama_disperse_2 Vwi-aot--- 19.89g pool_gama_disperse_2 0.09 pool_gama_disperse_2 vg_gama_disperse_2 twi-aot--- 19.89g 0.09 0.06 ``` This is expected behavior, but I note it here as people could be concerned with RHGS WA reporting 99 % utilization for the thin pool (panel LVM Thin Pool Data Usage).