Bug 1594342

Summary: growth of client connections when writing of a file fails on unavailable free space on a brick, file remains on the brick even after removal from the volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Martin Bukatovic <mbukatov>
Component: glusterfsAssignee: Amar Tumballi <atumball>
Status: CLOSED CURRENTRELEASE QA Contact: Bala Konda Reddy M <bmekala>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: amukherj, ravishankar, rhinduja, rhs-bugs, sankarshan, shtripat, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-10 07:51:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshot 1: Default dashboard with "Connection" chart
none
screenshot 2: Connection chart (when running over night)
none
tarball with the output files from gluster commans
none
screenshot 3: host dashboard for gl3 machine
none
screenshot 4: retrying with recent gluster build, with fix for heal-info hang none

Description Martin Bukatovic 2018-06-22 16:51:05 UTC
Description of problem
======================

When I copy a file, which is so large that it doesn't fit into it's brick,
to a gluster volume, I observe few problems:

 * when the copy operations fails on the insufficient space, the number of gluster
   connections starts to go up (as reported by gluster volume status <volname>
   clients)

 * the growth stops when the file in question is deleted, but then the number
   of connection stays at the given level (too high compared to original state)

 * when the file is deleted from the volume, the free space on the brick is not
   reclaimed, as the file is still available in <bricdir>/.glusterfs/unlink/
   directory)

Version-Release number of selected component
============================================

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-12.el7rhgs.x86_64
glusterfs-api-3.12.2-12.el7rhgs.x86_64
glusterfs-cli-3.12.2-12.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-12.el7rhgs.x86_64
glusterfs-events-3.12.2-12.el7rhgs.x86_64
glusterfs-fuse-3.12.2-12.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-12.el7rhgs.x86_64
glusterfs-libs-3.12.2-12.el7rhgs.x86_64
glusterfs-rdma-3.12.2-12.el7rhgs.x86_64
glusterfs-server-3.12.2-12.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
python2-gluster-3.12.2-12.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible
================

100%

Steps to Reproduce
==================

1. prepare 6 machines with at least 3 storage devices on each for future bricks
2. prepare dedicated client machine
3. prepare dedicated machine for WA
4. install gluster, create trusted storage pool with arbiter 2 plus 1x2 volume[1]
5. mount the volume on the client[2]
6. install RHGS WA on the dedicated machine
7. import trusted storage pool into WA for monitoring purposes

8. on the client, copy large file into the volume, this file should be so large
   that it doesn't fit into the bricks you have in the volume

9. while the copying is in progress, monitor:

    * output of `gluster volume status <volname> clients`
    * state dump from `gluster get-state detail` 
    * number of connections on cluster dashboard of WA

9. when the copy operation fails, wait about one hour, monitoring
   the same details as in previous step

10. on the client, run rm /mnt/<volname>/<bigfile> to remove the big file
    from the volume

11. wait at least for 15 minutes, monitoring the details as in step 9

[1] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.create.conf
[2] https://github.com/usmqe/usmqe-setup/blob/12fd9d3ea172cdbf56a808aa5161eefc3346ec1a/gdeploy_config/volume_beta_arbiter_2_plus_1x2.mount.conf

Actual results
==============

When copying of the large file fails (step #9) as expected:

```
[root@mbukatov-usm1-client tmp]# cp /tmp/enwiki-latest-pages-articles.xml.bz2 /mnt/volume_beta_arbiter_2_plus_1x2/
cp: error writing ‘/mnt/volume_beta_arbiter_2_plus_1x2/enwiki-latest-pages-articles.xml.bz2’: Transport endpoint is not connected
cp: failed to extend ‘/mnt/volume_beta_arbiter_2_plus_1x2/enwiki-latest-pages-articles.xml.bz2’: Transport endpoint is not connected
cp: failed to close ‘/mnt/volume_beta_arbiter_2_plus_1x2/enwiki-latest-pages-articles.xml.bz2’: Transport endpoint is not connected
[root@mbukatov-usm1-client tmp]# echo $?                                        
1
```

Then the number of gluster client connections starts to grow
(see screenshot 1 from cluster dashboard of RHGS WA).

When the file is removed from the volume, the number of gluster client
connections stops to grow, but it remains on it's current level
(see attached screenshot 1).

Moreover the removed file still available on the bricks, blocking free
space there:

```
[root@mbukatov-usm1-gl1 ~]# ls -lh /mnt/brick_beta_arbiter_3/3/.glusterfs/unlink/
total 9.9G
-rw-r--r--. 1 root root 9.9G Jun 22 15:09 efcc3fbc-57a7-40fa-8a2f-e3932509338b
```

While on the client, there is no such file:

```
[root@mbukatov-usm1-client ~]# ls -l /mnt/volume_beta_arbiter_2_plus_1x2/
total 0
```

See attached outputs from `gluster volume status <volname> clients`
and `gluster get-state detail` commands:

 * during copying
 * after the copying fails
 * after the file is removed

Expected results
================

Then number of gluster client connections doesn't start to grow when the
file fails to be copied to the volume because of insufficient free space
on a brick.

I would expect the file to be actually removed when I delete it from the
volume, but I may be wrong here as I'm not fully familiar with the unlink
feature.

Additional info
===============

I have tried this few time, once I let the client connections to grow over
night (see the screenshot 2 for details).

Comment 1 Martin Bukatovic 2018-06-22 16:54:16 UTC
Created attachment 1453784 [details]
screenshot 1: Default dashboard with "Connection" chart

Comment 2 Martin Bukatovic 2018-06-22 16:56:32 UTC
Created attachment 1453785 [details]
screenshot 2: Connection chart (when running over night)

Comment 3 Martin Bukatovic 2018-06-22 17:02:00 UTC
Screenshot 2 notes: the growth stops there when the file is removed.

Comment 4 Martin Bukatovic 2018-06-22 17:10:33 UTC
Created attachment 1453787 [details]
tarball with the output files from gluster commans

Comment 5 Vijay Bellur 2018-06-22 17:57:53 UTC
Thank you, Martin!

I observe a lot of client connections coming in from 10.37.169.62. Is 10.37.169.62 a client machine? Would it be possible to attach sosreport from that machine?

Comment 6 Martin Bukatovic 2018-06-22 18:20:13 UTC
10.37.169.62 is storage machine mbukatov-usm1-gl3, which seems to host
arbiter brick for the volume:

```
[root@mbukatov-usm1-gl1 ~]# gluster volume info volume_beta_arbiter_2_plus_1x2 | grep gl3
Brick3: mbukatov-usm1-gl3:/mnt/brick_beta_arbiter_1/1 (arbiter)
Brick9: mbukatov-usm1-gl3:/mnt/brick_beta_arbiter_2/2 (arbiter)
Brick15: mbukatov-usm1-gl3:/mnt/brick_beta_arbiter_3/3 (arbiter)
```

Comment 7 Vijay Bellur 2018-06-22 18:46:46 UTC
Would it be possible to collect sosreport from the arbiter node? Thx!

Comment 8 Martin Bukatovic 2018-06-22 18:56:39 UTC
Created attachment 1453838 [details]
screenshot 3: host dashboard for gl3 machine

While I'm waiting for the sosreport to finish, I noticed high memory usage
on that node, which is likely caused by lots of glfsheal processes:

```
# ps aux | grep glfsheal | wc -l
121
```

The memory spike clearly correlates with the failed upload of the big file,
as can be seen on the screenshot 3 from WA host dashboard attached.

Comment 10 Shubhendu Tripathi 2018-06-26 03:35:15 UTC
@Ravi, can you please check and comment as no of glfsheal processes is going up in this case.

Comment 12 Ravishankar N 2018-06-26 04:37:08 UTC
glfsheal is a binary that is launched when you run 'gluster volume heal $VOLNAME info'. It is basically a client process (AFR + protocol/client translator) that connects to the bricks of $VOLNAME, gets the list of files that need heal, print it to stdout and then terminates (thus closing the connections it established with the bricks).

comment #8 shows 121 glfsheal processes running, which probably means we ran the heal-info command multiple (121) times and each time the process did not terminate successfully. If this is not a transient state, we would need to find out if and why  'gluster volume heal $VOLNAME info' is not completing successfully.

Note: I'm currently engaged in RHGS stop gap related work and can take a look at the setup next week (not clearing the need-info until then). In the meantime, if anyone wants to take  a stab at this you can run 'glfsheal $VOLNAME' and see if and why its hung using gdb, looking at the log file (/var/log/glusterfs/glfsheal-$VOLNAME.log) etc.

Comment 16 Martin Bukatovic 2018-08-15 16:07:15 UTC
Created attachment 1476204 [details]
screenshot 4: retrying with recent gluster build, with  fix for heal-info hang

(In reply to Ravishankar N from comment #14)
> There was a recent fix for heal-info hang
> https://bugzilla.redhat.com/show_bug.cgi?id=1597654#c3 which should be
> available in the next downstream build of rhgs-3.4.0 (v3.12.2-14?). It might
> be worth seeing if the issue is reproducible with that build.

I retried the scenario during testing of related RHGS WA BZ 1594383 and
I no longer see the problem, so it seems that this problem has been addressed
by this recent fix of heal-info as well.

See screenshot 4.

Version of gluster used:

[root@mbukatov-usm1-gl1 ~]# rpm -qa | grep gluster | sort
glusterfs-3.12.2-16.el7rhgs.x86_64
glusterfs-api-3.12.2-16.el7rhgs.x86_64
glusterfs-cli-3.12.2-16.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64
glusterfs-events-3.12.2-16.el7rhgs.x86_64
glusterfs-fuse-3.12.2-16.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64
glusterfs-libs-3.12.2-16.el7rhgs.x86_64
glusterfs-rdma-3.12.2-16.el7rhgs.x86_64
glusterfs-server-3.12.2-16.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
python2-gluster-3.12.2-16.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

Comment 18 Martin Bukatovic 2018-08-15 17:41:06 UTC
Additional details
==================

This is more a note about RHGS WA and data reported on Brick Dashboard for
this use case. Nothing described here is a bug.

While the file is no longer stuck in .glusterfs/unlink directory:

```
[root@mbukatov-usm1-gl1 ~]# find /mnt/brick_* -name unlink | xargs tree
/mnt/brick_beta_arbiter_1/1/.glusterfs/unlink
/mnt/brick_beta_arbiter_2/2/.glusterfs/unlink
/mnt/brick_beta_arbiter_3/3/.glusterfs/unlink
/mnt/brick_gama_disperse_1/1/.glusterfs/unlink
/mnt/brick_gama_disperse_2/2/.glusterfs/unlink

0 directories, 0 files
```

The underlying thin volume reports 99% utilization:

```
[root@mbukatov-usm1-gl1 ~]# lvs
  LV                   VG                 Attr       LSize  Pool                 Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_beta_arbiter_1    vg_beta_arbiter_1  Vwi-aot--- <9.95g pool_beta_arbiter_1         0.72                                   
  pool_beta_arbiter_1  vg_beta_arbiter_1  twi-aot--- <9.95g                             0.72   0.11                            
  lv_beta_arbiter_2    vg_beta_arbiter_2  Vwi-aot--- <9.95g pool_beta_arbiter_2         0.74                                   
  pool_beta_arbiter_2  vg_beta_arbiter_2  twi-aot--- <9.95g                             0.74   0.11                            
  lv_beta_arbiter_3    vg_beta_arbiter_3  Vwi-aot--- <9.95g pool_beta_arbiter_3         98.76                                  
  pool_beta_arbiter_3  vg_beta_arbiter_3  twi-aot--- <9.95g                             98.76  2.43                            
  lv_gama_disperse_1   vg_gama_disperse_1 Vwi-aot--- <9.95g pool_gama_disperse_1        0.17                                   
  pool_gama_disperse_1 vg_gama_disperse_1 twi-aot--- <9.95g                             0.17   0.09                            
  lv_gama_disperse_2   vg_gama_disperse_2 Vwi-aot--- 19.89g pool_gama_disperse_2        0.09                                   
  pool_gama_disperse_2 vg_gama_disperse_2 twi-aot--- 19.89g                             0.09   0.06  
```

But when I run `fstrim /mnt/brick_beta_arbiter_3`, the space is reclaimed:

```
[root@mbukatov-usm1-gl1 ~]# lvs
  LV                   VG                 Attr       LSize  Pool                 Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_beta_arbiter_1    vg_beta_arbiter_1  Vwi-aot--- <9.95g pool_beta_arbiter_1         0.72                                   
  pool_beta_arbiter_1  vg_beta_arbiter_1  twi-aot--- <9.95g                             0.72   0.11                            
  lv_beta_arbiter_2    vg_beta_arbiter_2  Vwi-aot--- <9.95g pool_beta_arbiter_2         0.74                                   
  pool_beta_arbiter_2  vg_beta_arbiter_2  twi-aot--- <9.95g                             0.74   0.11                            
  lv_beta_arbiter_3    vg_beta_arbiter_3  Vwi-aot--- <9.95g pool_beta_arbiter_3         0.70                                   
  pool_beta_arbiter_3  vg_beta_arbiter_3  twi-aot--- <9.95g                             0.70   0.11                            
  lv_gama_disperse_1   vg_gama_disperse_1 Vwi-aot--- <9.95g pool_gama_disperse_1        0.17                                   
  pool_gama_disperse_1 vg_gama_disperse_1 twi-aot--- <9.95g                             0.17   0.09                            
  lv_gama_disperse_2   vg_gama_disperse_2 Vwi-aot--- 19.89g pool_gama_disperse_2        0.09                                   
  pool_gama_disperse_2 vg_gama_disperse_2 twi-aot--- 19.89g                             0.09   0.06   
```

This is expected behavior, but I note it here as people could be concerned with RHGS WA reporting 99 %
utilization for the thin pool (panel LVM Thin Pool Data Usage).