Bug 1724754

Summary:	fallocate of a file larger than brick size leads to increased brick usage despite failure
Product:	[Community] GlusterFS	Reporter:	Raghavendra Bhat <rabhat>
Component:	posix	Assignee:	bugs <bugs>
Status:	CLOSED UPSTREAM	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	bugs
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-12 13:22:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Raghavendra Bhat 2019-06-27 17:35:24 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Raghavendra Bhat 2019-06-27 20:37:11 UTC

fallocate -l <size> <file> command fails when the size mentioned is bigger than the brick size where fallocate is being directed to. But, it would have led to non-zero blocks usage by the file (used in fallocate command) even though fallocate fails. This leads to increased brick usage despite fallocate getting failed.

Better would be to ensure that, the file used in fallocate is truncated if fallocate fails.

Comment 2 Raghavendra Bhat 2019-06-27 20:41:26 UTC

Description of problem:
======================
when we use fallocate to create a file which is >= or to the max disk capacity, 
while we get a CLI error "fallocate: fallocate failed: No space left on device" 

However, the file gets created, and if you check the file size  on the mount  shows zero size, but if you check the volume space on the client (df -h) , it can be seen that the file is occupying significant space, that is because on the backend bricks  the file is created upto size of about 90% of the disk size( may be because of storage reserve space)


How reproducible:
===================
always


Steps to Reproduce:
1.create a 1x3 volume and fuse mount it
2. use fallocate to create a file which is >=size of of the brick
3. you would get error "fallocate: fallocate failed: No space left on device" 

Actual results:
============
however file is created and it shows as zero size file from mount point,
but the file does occupy about 90% of the brick size backend and the same reflects in df -h of the mount point




From client:

[root@hostname2]# pwd
/mnt/nfnas/falloc-test
[root@hostname2]# df -h
Filesystem                                Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dhcp42--60-root           44G  1.6G   43G   4% /
devtmpfs                                  3.9G     0  3.9G   0% /dev
tmpfs                                     3.9G     0  3.9G   0% /dev/shm
tmpfs                                     3.9G  8.5M  3.9G   1% /run
tmpfs                                     3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/sda1                                1014M  188M  827M  19% /boot
tmpfs                                     783M     0  783M   0% /run/user/0
hostname1:nfnas  2.2T  453G  1.8T  21% /mnt/nfnas ====> NOTICE THE USED SIZE OF Storage space 
[root@hostname2]# fallocate test -l 600GB
fallocate: fallocate failed: No space left on device
[root@hostname2]# df -h
Filesystem                                Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dhcp42--60-root           44G  1.6G   43G   4% /
devtmpfs                                  3.9G     0  3.9G   0% /dev
tmpfs                                     3.9G     0  3.9G   0% /dev/shm
tmpfs                                     3.9G  8.5M  3.9G   1% /run
tmpfs                                     3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/sda1                                1014M  188M  827M  19% /boot
tmpfs                                     783M     0  783M   0% /run/user/0
hostname1:nfnas  2.2T  925G  1.3T  43% /mnt/nfnas ===>NOTICE THE INCREASE IN USED SPACE
[root@hostname2]# ls
test
[root@dhcp42-60 falloc-test]# du -sh test
0       test
[root@hostname2]# stat test
  File: ‘test’
  Size: 0               Blocks: 0          IO Block: 131072 regular empty file
Device: 26h/38d Inode: 13717993992350864287  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:fusefs_t:s0
Access: 2019-05-08 18:24:52.342403239 +0530
Modify: 2019-05-08 18:24:52.342403239 +0530
Change: 2019-05-08 18:24:52.342403239 +0530
 Birth: -
[root@hostname2]# 



from server:
[root@hostname1]# ls /gluster/brick1
nfnas
[root@hostname1]# ls /gluster/brick1
brick1/  brick10/ brick11/ 
[root@hostname1]# ls /gluster/brick1/nfnas/
falloc-test  IOs  logs
[root@hostname1]# ls /gluster/brick1/nfnas/falloc-test/
test
[root@hostname1]# ls /gluster/brick1/nfnas/falloc-test/test 
/gluster/brick1/nfnas/falloc-test/test
[root@hostname1]# du -sh /gluster/brick1/nfnas/falloc-test/test
473G	/gluster/brick1/nfnas/falloc-test/test
[root@hostname1]# stat /gluster/brick1/nfnas/falloc-test/test
  File: ‘/gluster/brick1/nfnas/falloc-test/test’
  Size: 0         	Blocks: 990030216  IO Block: 4096   regular empty file
Device: fd17h/64791d	Inode: 1749722171  Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:glusterd_brick_t:s0
Access: 2019-05-08 18:24:52.343774310 +0530
Modify: 2019-05-08 18:24:52.343774310 +0530
Change: 2019-05-08 18:24:52.366773892 +0530
 Birth: -
[root@hostname1]# df -h /gluster/brick1/
Filesystem                           Size  Used Avail Use% Mounted on
/dev/mapper/GLUSTER_vg1-GLUSTER_lv1  547G  541G  6.9G  99% /gluster/brick1

Comment 3 Raghavendra Bhat 2019-06-27 20:48:07 UTC

I think the problem is with du -sh (or stat) on the fallocated file saying zero usage on a glusterfs client.

1) Volume info

1x3 replicate volume

Volume Name: mirror
Type: Replicate
Volume ID: 68535a1f-48c3-4e7b-86fc-ecc0143c2cfe
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: server1:/export1/tmp/mirror
Brick2: server2:/export1/tmp/mirror
Brick3: server3:/export1/tmp/mirror
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off


2) Bricks

 df -h
Filesystem                              Size  Used Avail Use% Mounted on
devtmpfs                                7.8G     0  7.8G   0% /dev
tmpfs                                   7.8G     0  7.8G   0% /dev/shm
tmpfs                                   7.8G  9.1M  7.8G   1% /run
tmpfs                                   7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/mapper/server1-root                 50G   22G   29G  44% /
/dev/sda1                              1014M  149M  866M  15% /boot
/dev/mapper/server1-root                500G   33M  500G   1% /home
tmpfs                                   1.6G     0  1.6G   0% /run/user/0
/dev/mapper/group-thin_vol              9.0G   34M  9.0G   1% /export1/tmp =======> Used as brick for the volume 
/dev/mapper/new-thin_vol                9.0G   33M  9.0G   1% /export2/tmp

i.e. /export1/tmp is used as brick in all the 3 nodes (same size as seen in above df command)

3) mounted the client

df -h
Filesystem                                     Size  Used Avail Use% Mounted on
devtmpfs                                       7.8G     0  7.8G   0% /dev
tmpfs                                          7.8G     0  7.8G   0% /dev/shm
tmpfs                                          7.8G  9.1M  7.8G   1% /run
tmpfs                                          7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/mapper/server3-root                       50G   22G   29G  44% /
/dev/mapper/server3-root                       1.8T   33M  1.8T   1% /home
/dev/sda1                                     1014M  157M  858M  16% /boot
tmpfs                                          1.6G     0  1.6G   0% /run/user/0
/dev/mapper/group-thin_vol                     9.0G   34M  9.0G   1% /export1/tmp
/dev/mapper/new-thin_vol                       9.0G   33M  9.0G   1% /export2/tmp
dell-per320-12.gsslab.rdu2.redhat.com:/mirror  9.0G  126M  8.9G   2% /mnt/glusterfs ======> freshly mounted client

4) Ran the TEST

[root@server3 glusterfs]# fallocate -l 22GB repro
fallocate: fallocate failed: No space left on device
[root@server3 glusterfs]# du -sh repro
0       repro     ============================================================================> du -sh says 0 file size

[root@server3 glusterfs]# stat repro
  File: ‘repro’
  Size: 0               Blocks: 0          IO Block: 131072 regular empty file   =====> stat showing 0 size and 0 blocks
Device: 28h/40d Inode: 12956667450403493410  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:fusefs_t:s0
Access: 2019-06-19 15:29:25.712546158 -0400
Modify: 2019-06-19 15:29:25.712546158 -0400
Change: 2019-06-19 15:29:25.712546158 -0400
 Birth: -


[root@server3 glusterfs]# df -h
Filesystem                                     Size  Used Avail Use% Mounted on
devtmpfs                                       7.8G     0  7.8G   0% /dev
tmpfs                                          7.8G     0  7.8G   0% /dev/shm
tmpfs                                          7.8G  9.1M  7.8G   1% /run
tmpfs                                          7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/mapper/server3-root                       50G   22G   29G  44% /
/dev/mapper/server3-home                       1.8T   33M  1.8T   1% /home
/dev/sda1                                     1014M  157M  858M  16% /boot
tmpfs                                          1.6G     0  1.6G   0% /run/user/0
/dev/mapper/group-thin_vol                     9.0G  1.2G  7.9G  13% /export1/tmp
/dev/mapper/new-thin_vol                       9.0G   33M  9.0G   1% /export2/tmp
server1:/mirror                                9.0G  1.3G  7.8G  14% /mnt/glusterfs =========> Increased consumption 


5) Ran a similar test on a XFS filesystem (i.e. no glusterfs, only xfs filesystem)

df -h
Filesystem                              Size  Used Avail Use% Mounted on
devtmpfs                                7.8G     0  7.8G   0% /dev
tmpfs                                   7.8G     0  7.8G   0% /dev/shm
tmpfs                                   7.8G  9.1M  7.8G   1% /run
tmpfs                                   7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/mapper/server1-root                50G   22G   29G  44% /
/dev/sda1                              1014M  149M  866M  15% /boot
/dev/mapper/server1-home                500G   33M  500G   1% /home
tmpfs                                   1.6G     0  1.6G   0% /run/user/0
/dev/mapper/group-thin_vol              9.0G  1.2G  7.9G  13% /export1/tmp
/dev/mapper/new-thin_vol                9.0G   33M  9.0G   1% /export2/tmp   ===========> a separate XFS filesytstem used in this test.

[root@server1 dir]# pwd
/export2/tmp/dir

[root@server1 dir]# fallocate -l 22GB repro
fallocate: fallocate failed: No space left on device
[root@server1 dir]# du -sh  repro
1.2G    repro

df -h
Filesystem                              Size  Used Avail Use% Mounted on
devtmpfs                                7.8G     0  7.8G   0% /dev
tmpfs                                   7.8G     0  7.8G   0% /dev/shm
tmpfs                                   7.8G  9.1M  7.8G   1% /run
tmpfs                                   7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/mapper/server1-root                50G   22G   29G  44% /
/dev/sda1                              1014M  149M  866M  15% /boot
/dev/mapper/server1-home                500G   33M  500G   1% /home
tmpfs                                   1.6G     0  1.6G   0% /run/user/0
/dev/mapper/group-thin_vol              9.0G  1.2G  7.9G  13% /export1/tmp
/dev/mapper/new-thin_vol                9.0G  1.2G  7.9G  13% /export2/tmp ==================> Increased usage after the fallocate test

stat repro
  File: ‘repro’
  Size: 0               Blocks: 2359088    IO Block: 4096   regular empty file ======> zero size but non-zero blocks
Device: fd0ch/64780d    Inode: 260         Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:unlabeled_t:s0
Access: 2019-06-19 16:15:57.072885431 -0400
Modify: 2019-06-19 16:15:57.072885431 -0400
Change: 2019-06-19 16:15:57.072885431 -0400
 Birth: -



CONCLUSION:
============

* So from the above tests, the xfs filesystem having a non-zero file is not the problem IIUC. GlusterFS reporting du -sh <file> and the number of blocks as zero
in the stat output is the problem.

* What happens is, as part of the operation (stat, du etc commands send a stat () system call), the backend disk receives the request, does the on disk stat () system call
and gives the response back to gluster brick process. THis is the stat response received just after gluster brick does on disk stat () operation (got from gdb attachment)

 p lstatbuf
$31 = {st_dev = 64775, st_ino = 260, st_nlink = 2, st_mode = 33188, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0,
  st_size = 0, st_blksize = 4096, st_blocks = 2358824, st_atim = {tv_sec = 1560972565, tv_nsec = 713592657},
  st_mtim = {tv_sec = 1560972565, tv_nsec = 713592657}, st_ctim = {tv_sec = 1560972565, tv_nsec = 716592631},
  __unused = {0, 0, 0}}

NOTE the non zero st_blocks received just after the response is received

* Gluster brick process now tries to converts the  'struct stat' structure (where the stat information is present) to
  its own internal 'struct iatt' structure and calls iatt_from_stat () function

* And in iatt_from_stat function, we handle the number of blocks information differently for sparse files

     iatt->ia_size = stat->st_size;
    iatt->ia_blksize = stat->st_blksize;
    iatt->ia_blocks = stat->st_blocks;

    /* There is a possibility that the backend FS (like XFS) can                                                                                                                              
       allocate blocks beyond EOF for better performance reasons, which                                                                                                                       
       results in 'st_blocks' with higher values than what is consumed by                                                                                                                     
       the file descriptor. This would break few logic inside GlusterFS,                                                                                                                      
       like quota behavior etc, thus we need the exact number of blocks                                                                                                                       
       which are consumed by the file to the higher layers inside GlusterFS.                                                                                                                  
       Currently, this logic won't work for sparse files (ie, file with                                                                                                                       
       holes)                                                                                                                                                                                 
    */
    {
        uint64_t maxblocks;

        maxblocks = (iatt->ia_size + 511) / 512;

        if (iatt->ia_blocks > maxblocks)
            iatt->ia_blocks = maxblocks;
    }

 For the fallocated file, stat->st_size (hence iatt->ia_size) will be zero. So, we 
 change the number of blocks (which in this case becomes zero).

* The same number of blocks information is used by du command to construct the file size

Like mentioned in the 1st comment of this bug, one way to handle this would be to ensure that in posix, if fallocate fails, it truncates the file to its last known size.

Comment 4 Worker Ant 2019-06-27 20:59:19 UTC

REVIEW: https://review.gluster.org/22969 (storage/posix: truncate the file to zero if fallocate fails) posted (#1) for review on master by Raghavendra Bhat

Comment 5 Worker Ant 2020-03-12 13:22:45 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/1003, and will be tracked there from now on. Visit GitHub issues URL for further details