Description of problem: This is similar to 1190633. We have a 3-way replica that is used to store sparse oVirt disk images. The same servers are also the oVirt nodes. When a machine is taken down for maintenance, the third machine's healing process "inflates" the sparse image to full size. In our scenario, we have 3 servers, each is an oVirt node and hosts a brick for the replica. In oVirt, one node is the SPM (Storage Master), and the other two are "normal." We leave the SPM alone, but put one of the "normal" machines into maintenance. Remove the brick from the machine being rebooted. Then reboot the machine, and add the brick back to the volume. Oddly, it's the *third* server that re-inflates the sparse images Version-Release number of selected component (if applicable): glusterfs-server-3.6.2-1.el6.x86_64 How reproducible: Always Steps to Reproduce: 1. Set up 3-node replication with sparse images 2. Set one non-SPM oVirt node to maintenance 3. Remove brick from gluster volume of the server in maintenance 4. Reboot 5. Add the brick back to the volume 6. The SPM & the recently-rebooted server act correctly. The third, untouched server, will inflate the sparse images. Actual results: Here's a "df" of the machines after rebooting. SPM machine (not rebooted): Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-ovirtlv 500G 12G 488G 3% /gluster/ovirt "Normal" machine, the one that was rebooted: Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-ovirtlv 500G 12G 489G 3% /gluster/ovirt The third, untouched, "normal" machine: Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-ovirtlv 500G 64G 437G 13% /gluster/ovirt Expected results: I expect the heal not to inflate sparse images on the third machine (or any of them, actually). This is definitely an issue since VMs tend to overallocate disk space. The only work around is to move disk images between storage domains, and then back to the original domain. Additional info: Here's the volume info: gluster> volume info ovirt Volume Name: ovirt Type: Replicate Volume ID: b39ed03e-0d03-40a2-acad-8384cf0c5cb4 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: server1:/gluster/ovirt/brick Brick2: server2:/gluster/ovirt/brick Brick3: server3:/gluster/ovirt/brick Options Reconfigured: cluster.data-self-heal-algorithm: diff server.allow-insecure: on storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off
Hi Matt, Bug 1190633 happens only when the data-self-heal-algorithm is set to 'full'. The issue shouldn't occur with it set to 'diff' which is what your volinfo shows. When you say: 3. Remove brick from gluster volume of the server in maintenance 5. Add the brick back to the volume what exactly do you mean? Unmounting and remounting the brick or reducing the replica count via remove-brick command and then adding it back again using add-brick command? Could you also check the specific file in question on all the bricks as opposed to df of the file system that you provided? i.e. `ls -lh /gluster/ovirt/file_name` and `du -h /gluster/ovirt/file_name` on all 3 bricks.
Hi Ravishankar, I haven't forgotten about this, I've been swamped. I hope to have this information to you by the end of this week.
(In reply to Matt R from comment #2) > Hi Ravishankar, > > I haven't forgotten about this, I've been swamped. I hope to have this > information to you by the end of this week. Sure Matt.
Hi Ravishankar, Here's more detailed information. For point 3 & 5 I removed the brick and reduced the replica. Then after rebooting, I formatted the file system, and re-added the brick, increasing the replica count. Below is a detailed list of the steps I performed to create this issue. Let me know if you need any more info. Thanks, Matt Before: Server1: [~]{575}# df -h /gluster/ovirt/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-ovirtlv 500G 25G 475G 5% /gluster/ovirt [~]{577}# du -h -s /gluster/ovirt/ 25G /gluster/ovirt/ Server2: [~]{484}# df -h /gluster/ovirt/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-ovirtlv 500G 25G 475G 5% /gluster/ovirt [~]{486}# du -hs /gluster/ovirt/ 25G /gluster/ovirt/ Server3: Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-ovirtlv 500G 25G 475G 5% /gluster/ovirt [~]{402}# du -hs /gluster/ovirt/ 25G /gluster/ovirt/ On oVirt: Put Server1 into maintenance mode Then: Remove brick: gluster> volume remove-brick ovirt replica 2 server1.umaryland.edu:/gluster/ovirt/brick force Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit force: success Reboot Server1 Before re-adding the brick: [~]{588}# gluster volume status ovirt Status of volume: ovirt Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick server3.umaryland.edu:/gluster/ovirt/brick 49154 Y 3187 Brick server2.umaryland.edu:/gluster/ovirt/brick 49154 Y 8894 NFS Server on localhost 2049 Y 3200 Self-heal Daemon on localhost N/A Y 3207 NFS Server on server2.umaryland.edu 2049 Y 3399 Self-heal Daemon on server2.umaryland.edu N/A Y 3413 NFS Server on server3.umaryland.edu 2049 Y 27589 Self-heal Daemon on server3.umaryland.edu N/A Y 27599 Task Status of Volume ovirt ------------------------------------------------------------------------------ There are no active volume tasks Server1: [~]{592}# mkfs.xfs -f -i size=512 /dev/rootvg/ovirtlv meta-data=/dev/rootvg/ovirtlv isize=512 agcount=16, agsize=8191984 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=131071744, imaxpct=25 = sunit=16 swidth=16 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=64000, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Add back brick: gluster> volume add-brick ovirt replica 3 server1.umaryland.edu:/gluster/ovirt/brick force volume add-brick: success After adding the brick: gluster> volume status ovirt Status of volume: ovirt Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick server3.umaryland.edu:/gluster/ovirt/brick 49154 Y 3187 Brick server2.umaryland.edu:/gluster/ovirt/brick 49154 Y 8894 Brick server1.umaryland.edu:/gluster/ovirt/brick 49155 Y 10213 NFS Server on localhost 2049 Y 8804 Self-heal Daemon on localhost N/A Y 8814 NFS Server on server1 2049 Y 10226 Self-heal Daemon on server1 N/A Y 10234 NFS Server on server3.umaryland.edu 2049 Y 31960 Self-heal Daemon on server3.umaryland.edu N/A Y 31971 Task Status of Volume ovirt ------------------------------------------------------------------------------ There are no active volume tasks After heal starts: Server1: Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-ovirtlv 500G 7.9G 492G 2% /gluster/ovirt [~]{614}# du -hs /gluster/ovirt/ 7.9G /gluster/ovirt/ (This will continue to grow until 25G, which is the actual amount of disk space used) Server2 (oVirt SPM): Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-ovirtlv 500G 25G 475G 6% /gluster/ovirt [~]{510}# du -hs /gluster/ovirt/ 25G /gluster/ovirt/ (This will stay the same) Server 3: Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-ovirtlv 500G 44G 457G 9% /gluster/ovirt [~]{426}# du -hs /gluster/ovirt/ 44G /gluster/ovirt/ (This will continue to grow to allocated, rather than sparse, size) And for one particular disk image directory: Server1: [./e545dbec-c16c-4b25-8d8e-6bcae0f925d1]{645}# du -m . 5898 . (Staying the same, oddly ~1GB smaller than the original) Server2: [./e545dbec-c16c-4b25-8d8e-6bcae0f925d1]{539}# du -m . 6874 . (Staying the same) Server3: [./e545dbec-c16c-4b25-8d8e-6bcae0f925d1]{454}# du -m . 23382 . (Growing)
I have found the same behaviour with latest glusterfs-3.7 nightly build, without involving oVirt in picture. These are the steps that I used to reproduce this issue : 1. Installed 3 RHEL servers with latest glusterfs-3.7 nightly build 2. Created a gluster cluster ( Trusted Storage Pool ) 3. Created a replica 3 volume 4. Fuse mounted the volume on another RHEL 6.6 server 5. Created a sparse file (i.e) dd if=/dev/urandom of=vm.img bs=1024 count=0 seek=24M 6. Performed fallocate on that file for 5G (i.e) fallocate -l5G vm.img 7. Reduce the replica count to 2 (i.e) gluster volume remove-brick <vol-name> replica 2 <brick3> 8. Added a new brick (i.e) gluster volume add-brick <vol-name> replica 3 <new-brick> 9. Triggered self-heal (i.e) gluster volume heal <vol-name> full 10. Wait till the self-heal gets completed 11. Check for the file size across all the bricks (i.e) du -sh <file> - would give the disk space du -sh <file> --apparent-size - would give the sparse file size From the above test, I could note that in one of the server, the file size has become 24G which proves the existence of this problem
raising priority and cc'ing people who are working with hyperconverged Gluster storage. So if sparse VM images are being used with Gluster, then this will matter because space will get used up fast by a self-heal. Not only that but performance will suffer because self-heal will have to do a lot more I/O.
Hi Sas, I'm still unable to re-create the issue on release-3.7 branch with the steps mentioned in comment #5. I see that in the replaced brick, the file is still sparse. Can you try to recreate this and share the setup with me? Thanks!
(In reply to Ben England from comment #6) > So if sparse VM images are being used with Gluster, then > this will matter because space will get used up fast by a self-heal. Not > only that but performance will suffer because self-heal will have to do a > lot more I/O. Hi Ben, We have logic in AFR selfheal which will not write to the sink if the file is sparse *and* the checksums of the data blocks (as it loops through all the chunks of the file) is same on the source and sink(s). There is also a possibility that the disk usage is different because the way XFS preallocation works (See BZ 1277992). In any case I'm going to work with Sas to see if comment#5 is repeatable and is related to gluster.
(In reply to Ravishankar N from comment #7) > Hi Sas, I'm still unable to re-create the issue on release-3.7 branch with > the steps mentioned in comment #5. I see that in the replaced brick, the > file is still sparse. Can you try to recreate this and share the setup with > me? Thanks! Hi Ravi, I have tested with glusterfs 3.7.6 on RHEL-7 (http://download.gluster.org/pub/gluster/glusterfs/3.7/3.7.6/EPEL.repo/epel-7Server/x86_64/ ) and I am not facing this issue Providing the test steps below : 1. Created a replica 3 volume and optimized it for virt-store usecase 2. Started the volume and used it as a RHEV Data Domain ( RHEV 3.5.6 ) 3. Killed one of the brick of the volume 4. Created 3 AppVMs ( Images of size - 15G, 20G, 25G respectively ) and installed RHEL 7.2 on the App VMs 5. Brought back the killed brick ( by starting the volume with force option ) 6. Triggered self-heal 7. After self-heal is completed, check the image size on all the bricks Observation : None of the image file on the 3 bricks, has lost sparseness property
Thanks Sas. I'm closing the bug. Matt, if you are still able to hit the issue in the latest bits, do feel free to raise a bug.