Description of problem: ----------------------- The size of the VM image file as reported from the fuse mount is incorrect. For the file of size 1 TB, the size of the file on the disk is reported as 1 ZB. Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHHI-V 1.6 - RHV 4.2.8 & RHGS 3.4.3 ( glusterfs-3.12.2-38.el7rhgs ) How reproducible: ------------------ Always Steps to Reproduce: ------------------- 1. On the Gluster storage domain, create the preallocated disk image of size 1TB 2. Check for the size of the file after its creation has succeesded Actual results: --------------- Size of the file is reported as 1 ZB, though the size of the file is 1TB Expected results: ----------------- Size of the file should be the same as the size created by the user Additional info: ---------------- Volume in the question is replica 3 sharded [root@rhsqa-grafton10 ~]# gluster volume info data Volume Name: data Type: Replicate Volume ID: 7eb49e90-e2b6-4f8f-856e-7108212dbb72 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: rhsqa-grafton10.lab.eng.blr.redhat.com:/gluster_bricks/data/data Brick2: rhsqa-grafton11.lab.eng.blr.redhat.com:/gluster_bricks/data/data Brick3: rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data (arbiter) Options Reconfigured: performance.client-io-threads: on nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 storage.owner-uid: 36 storage.owner-gid: 36 network.ping-timeout: 30 performance.strict-o-direct: on cluster.granular-entry-heal: enable cluster.enable-shared-storage: enable
Size of the file as reported from the fuse mount: [root@ ~]# ls -lsah /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b 8.0Z -rw-rw----. 1 vdsm kvm 1.1T Jan 21 17:14 /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b [root@ ~]# du -shc /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b 16E /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b 16E total Note that the disk image is preallocated with 1072GB of space
(In reply to SATHEESARAN from comment #1) > Size of the file as reported from the fuse mount: > > [root@ ~]# ls -lsah > /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\: > _data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad- > 15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b > 8.0Z -rw-rw----. 1 vdsm kvm 1.1T Jan 21 17:14 > /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/ > bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad- > 15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b > > [root@ ~]# du -shc > /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\: > _data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad- > 15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b > 16E > /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/ > bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad- > 15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b > 16E total > > Note that the disk image is preallocated with 1072GB of space I wonder if this happens specifically when fallocate is used to pre-allocate the image as opposed to the earlier implementation where writes were being used. Is this something you're seeing only after RHV changed the preallocation implementation to use fallocate? Or did you see this issue in previous versions of RHHI as well? -Krutika
Also, do you still have the setup in this state? If so, can I'd like to take a look. -Krutika
ping Sas - can you respond to the needinfo?
(In reply to Krutika Dhananjay from comment #3) > Also, do you still have the setup in this state? If so, can I'd like to take > a look. > > -Krutika Hi Krutika, The setup is no longer available. Let me recreate the issue and provide you the setup
(In reply to SATHEESARAN from comment #5) > (In reply to Krutika Dhananjay from comment #3) > > Also, do you still have the setup in this state? If so, can I'd like to take > > a look. > > > > -Krutika > > Hi Krutika, > > The setup is no longer available. Let me recreate the issue and provide you > the setup This issue is very easily reproducible. Create a preallocated image on the replicate volume with sharding enabled. Use 'qemu-img' to create the VM image. See the following test: [root@ ~]# qemu-img create -f raw -o preallocation=falloc /mnt/test/vm1.img 1T Formatting '/mnt/test/vm1.img', fmt=raw size=1099511627776 preallocation='falloc' [root@ ]# ls /mnt/test vm1.img [root@ ]# ls -lsah vm1.img 8.0Z -rw-r--r--. 1 root root 1.0T Apr 2 00:45 vm1.img
So I tried this locally and I am not hitting the issue - [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 10G Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img 10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 30G Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img 30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img Of course, I didn't go beyond 30G due to space constraints on my laptop. If you could share your setup where you're hitting this bug, I'll take a look. -Krutika
(In reply to Krutika Dhananjay from comment #7) > So I tried this locally and I am not hitting the issue - > > [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc > /mnt/vm1.img 10G > Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc > [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img > 10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img > > [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc > /mnt/vm1.img 30G > Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc > [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img > 30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img > > Of course, I didn't go beyond 30G due to space constraints on my laptop. > > If you could share your setup where you're hitting this bug, I'll take a > look. > > -Krutika I could see this very consistenly in two fashions 1. Create VM image >= 1TB -------------------------- [root@rhsqa-grafton7 test]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc [root@ ]# ls -lsah vm1.img 10G -rw-r--r--. 1 root root 10G May 2 10:30 vm1.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm2.img 50G Formatting 'vm2.img', fmt=raw size=53687091200 preallocation=falloc [root@ ]# ls -lsah vm2.img 50G -rw-r--r--. 1 root root 50G May 2 10:30 vm2.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm3.img 100G Formatting 'vm3.img', fmt=raw size=107374182400 preallocation=falloc [root@ ]# ls -lsah vm3.img 100G -rw-r--r--. 1 root root 100G May 2 10:33 vm3.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm4.img 500G Formatting 'vm4.img', fmt=raw size=536870912000 preallocation=falloc [root@ ]# ls -lsah vm4.img 500G -rw-r--r--. 1 root root 500G May 2 10:33 vm4.img Once the size reached 1TB, you will see this issue [root@ ]# qemu-img create -f raw -o preallocation=falloc vm6.img 1T Formatting 'vm6.img', fmt=raw size=1099511627776 preallocation=falloc [root@ ]# ls -lsah vm6.img 8.0Z -rw-r--r--. 1 root root 1.0T May 2 10:35 vm6.img <-------- size on disk is too much than expected 2. Recreate the image with the same name ----------------------------------------- Observe that for the second time, the image is created with the same name [root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc [root@ ]# ls -lsah vm1.img 10G -rw-r--r--. 1 root root 10G May 2 10:40 vm1.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 20G <-------- The same file name vm1.img is used Formatting 'vm1.img', fmt=raw size=21474836480 preallocation=falloc [root@ ]# ls -lsah vm1.img 30G -rw-r--r--. 1 root root 20G May 2 10:40 vm1.img <---------- size on the disk is 30G, though the file is created with 20G I will provide setup for the investigation
Found part of the issue. It's just a case of integer overflow. 32-bit signed int is being used to store delta between post-stat and pre-stat block-counts. The range of numbers for 32-bit signed int is [-2,147,483,648, 2,147,483,647] whereas the number of blocks allocated as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648 which is just 1 more than INT_MAX (2,147,483,647) which spills over to the negative half the scale making it -2,147,483,648. This number, on being copied to int64 causes the most-significant 32 bits to be filled with 1 making the block-count equal 554050781183 (or 0xffffffff80000000) in magnitude. That's the block-count that gets set on the backend in trusted.glusterfs.shard.file-size xattr in the block-count segment - [root@rhsqa-grafton7 data]# getfattr -d -m . -e hex /gluster_bricks/data/data/vm3.img getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/data/data/vm3.img security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6 trusted.gfid2path.6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f766d332e696d67 trusted.glusterfs.shard.block-size=0x0000000004000000 trusted.glusterfs.shard.file-size=0x00000100000000000000000000000000ffffffff800000000000000000000000 <-- notice the "ffffffff80000000" in the block-count segment But .. [root@rhsqa-grafton7 test]# stat vm3.img File: ‘vm3.img’ Size: 1099511627776 Blocks: 18446744071562067968 IO Block: 131072 regular file Device: 29h/41d Inode: 11473626732659815398 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:fusefs_t:s0 Access: 2019-05-02 14:11:11.693559069 +0530 Modify: 2019-05-02 14:12:38.245068328 +0530 Change: 2019-05-02 14:15:56.190546751 +0530 Birth: - stat shows block-count as 18446744071562067968 which is way bigger than (554050781183 * 512). In the response path, turns out the block-count further gets assigned to a uint64 number. The same number, when expressed as uint64 becomes 18446744071562067968. 18446744071562067968 * 512 is a whopping 8.0 Zettabytes! This bug wasn't seen earlier because the earlier way of preallocating files never used fallocate, so the original signed 32 int variable delta_blocks would never exceed 131072. Anyway, I'll be soon sending a fix for this. Sas, Do you have a single node with at least 1TB free space that you can lend me where I can test the fix? The bug will only be hit when the image size is > 1TB. -Krutika
(In reply to Krutika Dhananjay from comment #10) > Found part of the issue. Sorry, this not part of the issue but THE issue in its entirety. (That line is from an older draft I'd composed which I forgot to change after rc'ing the bug) > > It's just a case of integer overflow. > 32-bit signed int is being used to store delta between post-stat and > pre-stat block-counts. > The range of numbers for 32-bit signed int is [-2,147,483,648, > 2,147,483,647] whereas the number of blocks allocated > as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648 > which is just 1 more than INT_MAX (2,147,483,647) > which spills over to the negative half the scale making it -2,147,483,648. > This number, on being copied to int64 causes the most-significant 32 bits to > be filled with 1 making the block-count equal 554050781183 (or > 0xffffffff80000000) in magnitude. > That's the block-count that gets set on the backend in > trusted.glusterfs.shard.file-size xattr in the block-count segment - > > [root@rhsqa-grafton7 data]# getfattr -d -m . -e hex > /gluster_bricks/data/data/vm3.img > getfattr: Removing leading '/' from absolute path names > # file: gluster_bricks/data/data/vm3.img > security. > selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f7 > 43a733000 > trusted.afr.dirty=0x000000000000000000000000 > trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6 > trusted.gfid2path. > 6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030 > 303030303030312f766d332e696d67 > > trusted.glusterfs.shard.block-size=0x0000000004000000 > trusted.glusterfs.shard.file- > size=0x00000100000000000000000000000000ffffffff800000000000000000000000 <-- > notice the "ffffffff80000000" in the block-count segment > > But .. > > [root@rhsqa-grafton7 test]# stat vm3.img > File: ‘vm3.img’ > Size: 1099511627776 Blocks: 18446744071562067968 IO Block: 131072 > regular file > Device: 29h/41d Inode: 11473626732659815398 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Context: system_u:object_r:fusefs_t:s0 > Access: 2019-05-02 14:11:11.693559069 +0530 > Modify: 2019-05-02 14:12:38.245068328 +0530 > Change: 2019-05-02 14:15:56.190546751 +0530 > Birth: - > > stat shows block-count as 18446744071562067968 which is way bigger than > (554050781183 * 512). > > In the response path, turns out the block-count further gets assigned to a > uint64 number. > The same number, when expressed as uint64 becomes 18446744071562067968. > 18446744071562067968 * 512 is a whopping 8.0 Zettabytes! > > This bug wasn't seen earlier because the earlier way of preallocating files > never used fallocate, so the original signed 32 int variable delta_blocks > would never exceed 131072. > > Anyway, I'll be soon sending a fix for this. > > Sas, > > Do you have a single node with at least 1TB free space that you can lend me > where I can test the fix? The bug will only be hit when the image size is > > 1TB. > > -Krutika
(In reply to Krutika Dhananjay from comment #10) > > Sas, > > Do you have a single node with at least 1TB free space that you can lend me > where I can test the fix? The bug will only be hit when the image size is > > 1TB. > > -Krutika Hi Krutika, I will create a VM for you and provide the details for the same
https://review.gluster.org/22655 With the fix: [root@dhcp35-114 rpms-with-fix]# cd /mnt [root@dhcp35-114 mnt]# qemu-img create -f raw -o preallocation=falloc vm6.img 1T Formatting 'vm6.img', fmt=raw size=1099511627776 preallocation=falloc [root@dhcp35-114 mnt]# ls -lsah total 1.1T 4.0K drwxr-xr-x. 4 root root 4.0K May 3 11:34 . 0 dr-xr-xr-x. 18 root root 239 May 2 17:26 .. 1.0T -rw-r--r--. 1 root root 1.0T May 3 11:34 vm6.img Note that this patch doesn't fix the other issue of 20G image reallocation showing up as 30G in size. That's something I will investigate now. -Krutika
OK, found the cause of the other issue. Turns out the second call to qemu-img does an ftruncate of the original 10G file to size 0 and then does fallocate - <strace-output> ... ... fstat(10, {st_mode=S_IFREG|0644, st_size=10737418240, ...}) = 0 <0.000044> ftruncate(10, 0) = 0 <0.020559> write(8, "\1\0\0\0\0\0\0\0", 8) = 8 <0.000085> futex(0x5555fe8a4078, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1556867505, 281258000}, ffffffff) = 0 <0.000092> fstat(10, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000025> fallocate(10, 0, 0, 21474836480) = 0 <0.098003> ... ... </strace-output> and TRUNCATE fop in shard is not updating the block-count correctly (in fact it accounts for the change in block-count of only the base shard). In this example, it would mean that when a 10G file is truncated to 0B, the delta becomes ((10GB - 64MB)/512) = 20971520 - 131072 = 20840448 (a delta that is too close to the original 20971520 to be considered a big change). -Krutika
In RHHI use case, size of image files > 1TB is shown incorrectly because of this issue. As the issue is already RCA'ed proposing this bug for RHGS 3.5.0
Here's the fix to the second issue - https://review.gluster.org/c/glusterfs/+/22681 Moving this bz to POST.
Fixes to both issues are merged in master - https://review.gluster.org/q/topic:%22ref-1705884%22+(status:open%20OR%20status:merged) Providing devel_ack so it can be considered for RHGS-3.5.0
Providing the qa_ack for this bug as this bug is essential, can be verified for RHGS 3.5.0 and already has devel_ack
Found a crash with two of the tests under tests/bugs/shard while testing the backport. Holding off porting until I figure the cause of the crash. -Krutika
(In reply to Krutika Dhananjay from comment #24) > Found a crash with two of the tests under tests/bugs/shard while testing the > backport. Holding off porting until I figure the cause of the crash. > > -Krutika Sorry, false alarm. Posted the patches downstream - https://code.engineering.redhat.com/gerrit/#/q/topic:ref-1668001+(status:open+OR+status:merged)
Tested with RHVH 4.3.5 based on RHEL 7.7 with interim RHGS 3.5.0 build ( glusterfs-6.0-7 ) with the following scenarios: 1. Created preallocated image file of size 1TB or more. -------------------------------------------------------- Observed that the size of the image file is consistent now. [root@ ]# qemu-img create -f raw -o preallocation=falloc vm2.img 1T Formatting 'vm2.img', fmt=raw size=1099511627776 preallocation=falloc [root@ ]# ls -lsah vm2.img 1.0T -rw-r--r--. 1 root root 1.0T Jul 3 21:26 vm2.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm3.img 1.5T Formatting 'vm3.img', fmt=raw size=1649267441664 preallocation=falloc [root@ ]# ls -lsah vm3.img 1.5T -rw-r--r--. 1 root root 1.5T Jul 3 21:26 vm3.img 2. Created preallocated image file with the same name ------------------------------------------------------ [root@]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc [root@ ]# ls -lsah vm1.img 10G -rw-r--r--. 1 root root 10G Jul 3 21:25 vm1.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc [root@ ]# ls -lsah vm1.img 10G -rw-r--r--. 1 root root 10G Jul 3 21:26 vm1.img In this case, the size of the image file is consistent
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3249