Disperse volume 'df' usage is extremely incorrect after replace-brick. Disperse volume 'df' usage statistics are extremely incorrect after replace brick where the source brick is down. On a 3 brick redundancy 1 disperse volume, the available space is reduced by 50%, and the used 'inode' count goes up by 50% even on empty volumes. The 'df' usage numbers are wrong on both FUSE and NFS v3 mounts. Starting/stopping the disperse volume, and remounting the client does not correct the 'df' usage numbers. When the replace-brick is done while the source brick is running, the 'df' usage statistics after a replace brick seem to be OK. It looks as though only the statfs() numbers that 'df' is using are incorrect; the actual disperse volume space and inode usage looks OK. In some ways, that makes the issue cosmetic, except for any applications or features that use and believe these numbers. Test plan: # gluster --version glusterfs 3.12.14 ##### Start with empty bricks, on separate file-systems. # df -h /exports/brick-* Filesystem Size Used Avail Use% Mounted on /dev/sdd 100G 33M 100G 1% /exports/brick-1 /dev/sde 100G 33M 100G 1% /exports/brick-2 /dev/sdf 100G 33M 100G 1% /exports/brick-3 /dev/sdg 100G 33M 100G 1% /exports/brick-4 /dev/sdh 100G 33M 100G 1% /exports/brick-5 /dev/sdi 100G 33M 100G 1% /exports/brick-6 # df -h -i /exports/brick-* Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdd 50M 3 50M 1% /exports/brick-1 /dev/sde 50M 3 50M 1% /exports/brick-2 /dev/sdf 50M 3 50M 1% /exports/brick-3 /dev/sdg 50M 3 50M 1% /exports/brick-4 /dev/sdh 50M 3 50M 1% /exports/brick-5 /dev/sdi 50M 3 50M 1% /exports/brick-6 ##### Create the disperse volume: # mkdir /exports/brick-1/disp-vol /exports/brick-2/disp-vol /exports/brick-3/disp-vol /exports/brick-4/disp-vol /exports/brick-5/disp-vol /exports/brick-6/disp-vol # gluster volume create disp-vol disperse-data 2 redundancy 1 transport tcp 10.0.0.28:/exports/brick-1/disp-vol/ 10.0.0.28:/exports/brick-2/disp-vol/ 10.0.0.28:/exports/brick-3/disp-vol/ force volume create: disp-vol: success: please start the volume to access data # gluster volume start disp-vol volume start: disp-vol: success ##### Mount the disperse volume using both FUSE and NFS v3: # mkdir /mnt/disp-vol-fuse # mkdir /mnt/disp-vol-nfs # mount -t glusterfs -o acl,log-level=WARNING,fuse-mountopts=noatime 127.0.0.1:/disp-vol /mnt/disp-vol-fuse/ # gluster volume set disp-vol nfs.disable off Gluster NFS is being deprecated in favor of NFS-Ganesha Enter "yes" to continue using Gluster NFS (y/n) yes volume set: success # mount 127.0.0.1:/disp-vol /mnt/disp-vol-nfs/ ##### Initially, the space and inode usage numbers are correct: # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 200G 65M 200G 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 200G 64M 200G 1% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 50M 22 50M 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 50M 22 50M 1% /mnt/disp-vol-nfs # df -h -i /exports/brick-* Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdd 50M 22 50M 1% /exports/brick-1 /dev/sde 50M 20 50M 1% /exports/brick-2 /dev/sdf 50M 20 50M 1% /exports/brick-3 ##### Create a file to use up some space: # fallocate -l 25G /mnt/disp-vol-fuse/file.1 # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 200G 26G 175G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 200G 26G 175G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 50M 26 50M 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 50M 26 50M 1% /mnt/disp-vol-nfs # df -h /exports/brick-* Filesystem Size Used Avail Use% Mounted on /dev/sdd 100G 13G 88G 13% /exports/brick-1 /dev/sde 100G 13G 88G 13% /exports/brick-2 /dev/sdf 100G 13G 88G 13% /exports/brick-3 # df -h -i /exports/brick-* Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdd 50M 26 50M 1% /exports/brick-1 /dev/sde 50M 24 50M 1% /exports/brick-2 /dev/sdf 50M 24 50M 1% /exports/brick-3 ##### Perform the first replace-brick with the source brick being up: # gluster volume replace-brick disp-vol 10.0.0.28:/exports/brick-1/disp-vol/ 10.0.0.28:/exports/brick-4/disp-vol/ commit force volume replace-brick: success: replace-brick commit force operation successful # gluster volume heal disp-vol info Brick 10.0.0.28:/exports/brick-4/disp-vol Status: Connected Number of entries: 0 Brick 10.0.0.28:/exports/brick-2/disp-vol /file.1 Status: Connected Number of entries: 1 Brick 10.0.0.28:/exports/brick-3/disp-vol /file.1 Status: Connected Number of entries: 1 ##### After first replace-brick with up source brick, the space and inode usage numbers are correct: # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 200G 26G 175G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 200G 26G 175G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 50M 24 50M 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 50M 24 50M 1% /mnt/disp-vol-nfs # df -h /exports/brick-* Filesystem Size Used Avail Use% Mounted on /dev/sde 100G 13G 88G 13% /exports/brick-2 /dev/sdf 100G 13G 88G 13% /exports/brick-3 /dev/sdg 100G 8.1G 92G 9% /exports/brick-4 # gluster volume heal disp-vol info Brick 10.0.0.28:/exports/brick-4/disp-vol Status: Connected Number of entries: 0 Brick 10.0.0.28:/exports/brick-2/disp-vol Status: Connected Number of entries: 0 Brick 10.0.0.28:/exports/brick-3/disp-vol Status: Connected Number of entries: 0 ##### Still good after healing is done: # df -h /exports/brick-* Filesystem Size Used Avail Use% Mounted on /dev/sde 100G 13G 88G 13% /exports/brick-2 /dev/sdf 100G 13G 88G 13% /exports/brick-3 /dev/sdg 100G 13G 88G 13% /exports/brick-4 # df -h -i /exports/brick-* Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sde 50M 24 50M 1% /exports/brick-2 /dev/sdf 50M 24 50M 1% /exports/brick-3 /dev/sdg 50M 24 50M 1% /exports/brick-4 # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 200G 26G 175G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 200G 26G 175G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 50M 24 50M 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 50M 24 50M 1% /mnt/disp-vol-nfs ##### Kill brick-2 process to simulate failure: # gluster volume status disp-vol Status of volume: disp-vol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.0.0.28:/exports/brick-4/disp-vol 62003 0 Y 110996 Brick 10.0.0.28:/exports/brick-2/disp-vol 62001 0 Y 107148 Brick 10.0.0.28:/exports/brick-3/disp-vol 62002 0 Y 107179 NFS Server on localhost 2049 0 Y 111004 Self-heal Daemon on localhost N/A N/A Y 111015 Task Status of Volume disp-vol ------------------------------------------------------------------------------ There are no active volume tasks # kill 107148 ##### Before the replace-brick, the 'df' numbers are still good: # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 200G 26G 175G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 200G 26G 175G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 50M 24 50M 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 50M 24 50M 1% /mnt/disp-vol-nfs ##### After the replace-brick with a down source brick, the 'df' numbers are still crazy, volume size reduced by 50%, and inode use went from 1% to 51%: # gluster volume replace-brick disp-vol 10.0.0.28:/exports/brick-2/disp-vol/ 10.0.0.28:/exports/brick-5/disp-vol/ commit force volume replace-brick: success: replace-brick commit force operation successful # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 100G 13G 88G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 100G 13G 88G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 50M 26M 25M 51% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 50M 26M 25M 51% /mnt/disp-vol-nfs # gluster volume heal disp-vol info Brick 10.0.0.28:/exports/brick-4/disp-vol /file.1 Status: Connected Number of entries: 1 Brick 10.0.0.28:/exports/brick-5/disp-vol Status: Connected Number of entries: 0 Brick 10.0.0.28:/exports/brick-3/disp-vol /file.1 Status: Connected Number of entries: 1 # df -h /exports/brick-* Filesystem Size Used Avail Use% Mounted on /dev/sdf 100G 13G 88G 13% /exports/brick-3 /dev/sdg 100G 13G 88G 13% /exports/brick-4 /dev/sdh 100G 2.1G 98G 3% /exports/brick-5 # df -h -i /exports/brick-* Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdf 50M 24 50M 1% /exports/brick-3 /dev/sdg 50M 24 50M 1% /exports/brick-4 /dev/sdh 50M 24 50M 1% /exports/brick-5 ##### 'df' numbers are no better after healing is done: # gluster volume heal disp-vol info Brick 10.0.0.28:/exports/brick-4/disp-vol Status: Connected Number of entries: 0 Brick 10.0.0.28:/exports/brick-5/disp-vol Status: Connected Number of entries: 0 Brick 10.0.0.28:/exports/brick-3/disp-vol Status: Connected Number of entries: 0 # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 100G 13G 88G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 100G 13G 88G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 50M 26M 25M 51% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 50M 26M 25M 51% /mnt/disp-vol-nfs # df -h /exports/brick-* Filesystem Size Used Avail Use% Mounted on /dev/sdf 100G 13G 88G 13% /exports/brick-3 /dev/sdg 100G 13G 88G 13% /exports/brick-4 /dev/sdh 100G 13G 88G 13% /exports/brick-5 # df -h -i /exports/brick-* Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdf 50M 24 50M 1% /exports/brick-3 /dev/sdg 50M 24 50M 1% /exports/brick-4 /dev/sdh 50M 24 50M 1% /exports/brick-5 #### Stopping/starting the disperse volume, and remounting clients does not help: # gluster volume stop disp-vol Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: disp-vol: success # gluster volume start disp-vol volume start: disp-vol: success # mount -t glusterfs -o acl,log-level=WARNING,fuse-mountopts=noatime 127.0.0.1:/disp-vol /mnt/disp-vol-fuse/ # mount 127.0.0.1:/disp-vol /mnt/disp-vol-nfs/ # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 100G 13G 88G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 100G 13G 88G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 50M 26M 25M 51% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 50M 26M 25M 51% /mnt/disp-vol-nfs ##### Simulate a second brick failure, and replacement: # gluster volume status disp-vol Status of volume: disp-vol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.0.0.28:/exports/brick-4/disp-vol 62001 0 Y 121258 Brick 10.0.0.28:/exports/brick-5/disp-vol 62004 0 Y 121278 Brick 10.0.0.28:/exports/brick-3/disp-vol 62005 0 Y 121298 NFS Server on localhost 2049 0 Y 121319 Self-heal Daemon on localhost N/A N/A Y 121328 Task Status of Volume disp-vol ------------------------------------------------------------------------------ There are no active volume tasks # kill 121298 # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 100G 13G 88G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 100G 13G 88G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 25M 12 25M 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 25M 12 25M 1% /mnt/disp-vol-nfs # gluster volume replace-brick disp-vol 10.0.0.28:/exports/brick-3/disp-vol/ 10.0.0.28:/exports/brick-6/disp-vol/ commit force volume replace-brick: success: replace-brick commit force operation successful ##### After the second replace-brick with a down source brick, the volume size reported by 'df' goes down by another 33%. The inode usage went back down from 51%, but it is now less than the number the volume started with, which is suspicious, and the total number of inodes has gone from a starting value of 50M down to 17M! # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 67G 8.4G 59G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 67G 8.4G 59G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 17M 8 17M 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 17M 8 17M 1% /mnt/disp-vol-nfs # df -h /exports/brick-* Filesystem Size Used Avail Use% Mounted on /dev/sdg 100G 13G 88G 13% /exports/brick-4 /dev/sdh 100G 13G 88G 13% /exports/brick-5 /dev/sdi 100G 2.1G 98G 3% /exports/brick-6 # df -h -i /exports/brick-* Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdg 50M 24 50M 1% /exports/brick-4 /dev/sdh 50M 24 50M 1% /exports/brick-5 /dev/sdi 50M 24 50M 1% /exports/brick-6 # gluster volume heal disp-vol info Brick 10.0.0.28:/exports/brick-4/disp-vol Status: Connected Number of entries: 0 Brick 10.0.0.28:/exports/brick-5/disp-vol Status: Connected Number of entries: 0 Brick 10.0.0.28:/exports/brick-6/disp-vol Status: Connected Number of entries: 0 ##### 'df' numbers are no better after healing is done: # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 67G 8.4G 59G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 67G 8.4G 59G 13% /mnt/disp-vol-nfs # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 17M 8 17M 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 17M 8 17M 1% /mnt/disp-vol-nfs # df -h /exports/brick-* Filesystem Size Used Avail Use% Mounted on /dev/sdg 100G 13G 88G 13% /exports/brick-4 /dev/sdh 100G 13G 88G 13% /exports/brick-5 /dev/sdi 100G 13G 88G 13% /exports/brick-6 # df -h -i /exports/brick-* Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sdg 50M 24 50M 1% /exports/brick-4 /dev/sdh 50M 24 50M 1% /exports/brick-5 /dev/sdi 50M 24 50M 1% /exports/brick-6 # gluster volume info disp-vol Volume Name: disp-vol Type: Disperse Volume ID: fb9cccb8-311f-49ac-948d-60e4894da0b6 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.0.0.28:/exports/brick-4/disp-vol Brick2: 10.0.0.28:/exports/brick-5/disp-vol Brick3: 10.0.0.28:/exports/brick-6/disp-vol Options Reconfigured: transport.address-family: inet nfs.disable: off ##### Note that although 'df' is saying the disperse volume is only 67G, it really still does have 200GB of space. # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 67G 8.4G 59G 13% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 67G 8.4G 59G 13% /mnt/disp-vol-nfs # fallocate -l 25G /mnt/disp-vol-fuse/file.2 # fallocate -l 25G /mnt/disp-vol-fuse/file.3 # fallocate -l 25G /mnt/disp-vol-fuse/file.4 # fallocate -l 25G /mnt/disp-vol-fuse/file.5 # fallocate -l 25G /mnt/disp-vol-fuse/file.6 # fallocate -l 25G /mnt/disp-vol-fuse/file.7 # fallocate -l 25G /mnt/disp-vol-fuse/file.8 fallocate: /mnt/disp-vol-fuse/file.8: fallocate failed: No space left on device # df -h /mnt/disp-vol-* Filesystem Size Used Avail Use% Mounted on 127.0.0.1:/disp-vol 67G 62G 5.4G 93% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 67G 62G 5.4G 93% /mnt/disp-vol-nfs # du -sh /mnt/disp-vol-fuse/ 176G /mnt/disp-vol-fuse/ # du -sh /mnt/disp-vol-nfs/ 176G /mnt/disp-vol-nfs/ # df -h -i /mnt/disp-vol-* Filesystem Inodes IUsed IFree IUse% Mounted on 127.0.0.1:/disp-vol 5.4M 15 5.4M 1% /mnt/disp-vol-fuse 127.0.0.1:/disp-vol 5.4M 15 5.4M 1% /mnt/disp-vol-nfs
Note that with a cursory test, this issue does not appear to occur on an older GlusterFS version: # gluster --version glusterfs 3.7.18 built on May 25 2018 16:07:41
The problem of the 'df' usage being wrong is because the GlusterFS brick shared count 'shared-brick-count' for the segments is being incremented, where it should have been always been 1. What the 'shared-brick-count' is for is when a volume has multiple bricks on the same file-system. In such cases, as multiple bricks cannot use the full space of the file-system since they are sharing, the space for only one of them is counted. However, in this case there was no brick file-system sharing going on. GlusterFS uses the file-system ID from f_fsid field from statvfs() to determine when multiple bricks are on the same file-system. Unfortunately, 'replace-brick' was not reading the sys_statvfs() 'f_fsid' value from the new brick, so 'brick- fsid' in the brick spec file was being set to 0. For the first 'replace-brick' this would be OK, but when another brick was replaced, also with 'brick-fsid' being 0, there could then be multiple bricks with the 'statfs_fsid' value of zero, so 'shared-brick-count' would be incremented, and its space would be subtracted from the volume.
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.
Sanju, Can you take a look at this? Thanks, Nithya
Moving this to 'distribute' component.
This is not a dht issue - moving this to glusterd.
REVIEW: https://review.gluster.org/21513 (glusterd: set fsid while performing replace brick) posted (#1) for review on master by Sanju Rakonde
Updated reproducer: 1. create any type of volume which supports replace-brick operation, having at least two bricks(B1, B2,..) 2. start the volume 3. mount the volume and check the volume size using df 4. perform a replace-brick operation on B1 5. check the size at mount point using df, it should be same as in step 3 6. perform replace-brick operation on B2. 7. check the size at mount point using df, it will be reduced by half RCA: While performing the replace brick operation we are not setting the fsid for the new brick. So the new brick will have fsid as 0. when we perform 2nd replace-brick operation, again the 2nd new brick will have fsid as 0. So, there will be two bricks which have fsid as 0. While calculating shared-brick-count, we consider the value of fsid of the bricks. If bricks are having same fsid, that means they are sharing the same file system. shared-brick-count refers to number of bricks that are sharing the same file system. In this case, shared-brick-count becomes 2 (as both new bricks are having fsid as 0). So, after 2nd replace brick operation the volume size at the mount point will be reduced by half. Thanks, Sanju
REVIEW: https://review.gluster.org/21513 (glusterd: set fsid while performing replace brick) posted (#3) for review on master by Atin Mukherjee
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report. glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html [2] https://www.gluster.org/pipermail/gluster-users/