+++ This bug was initially created as a clone of Bug #1705884 +++ +++ This bug was initially created as a clone of Bug #1668001 +++ Description of problem: ----------------------- The size of the VM image file as reported from the fuse mount is incorrect. For the file of size 1 TB, the size of the file on the disk is reported as 8 ZB. Version-Release number of selected component (if applicable): ------------------------------------------------------------- upstream master How reproducible: ------------------ Always Steps to Reproduce: ------------------- 1. On the Gluster storage domain, create the preallocated disk image of size 1TB 2. Check for the size of the file after its creation has succeesded Actual results: --------------- Size of the file is reported as 8 ZB, though the size of the file is 1TB Expected results: ----------------- Size of the file should be the same as the size created by the user Additional info: ---------------- Volume in the question is replica 3 sharded [root@rhsqa-grafton10 ~]# gluster volume info data Volume Name: data Type: Replicate Volume ID: 7eb49e90-e2b6-4f8f-856e-7108212dbb72 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: rhsqa-grafton10.lab.eng.blr.redhat.com:/gluster_bricks/data/data Brick2: rhsqa-grafton11.lab.eng.blr.redhat.com:/gluster_bricks/data/data Brick3: rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data (arbiter) Options Reconfigured: performance.client-io-threads: on nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 storage.owner-uid: 36 storage.owner-gid: 36 network.ping-timeout: 30 performance.strict-o-direct: on cluster.granular-entry-heal: enable cluster.enable-shared-storage: enable --- Additional comment from SATHEESARAN on 2019-01-21 16:32:39 UTC --- Size of the file as reported from the fuse mount: [root@ ~]# ls -lsah /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b 8.0Z -rw-rw----. 1 vdsm kvm 1.1T Jan 21 17:14 /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b [root@ ~]# du -shc /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b 16E /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b 16E total Note that the disk image is preallocated with 1072GB of space --- Additional comment from SATHEESARAN on 2019-04-01 19:25:15 UTC --- (In reply to SATHEESARAN from comment #5) > (In reply to Krutika Dhananjay from comment #3) > > Also, do you still have the setup in this state? If so, can I'd like to take > > a look. > > > > -Krutika > > Hi Krutika, > > The setup is no longer available. Let me recreate the issue and provide you > the setup This issue is very easily reproducible. Create a preallocated image on the replicate volume with sharding enabled. Use 'qemu-img' to create the VM image. See the following test: [root@ ~]# qemu-img create -f raw -o preallocation=falloc /mnt/test/vm1.img 1T Formatting '/mnt/test/vm1.img', fmt=raw size=1099511627776 preallocation='falloc' [root@ ]# ls /mnt/test vm1.img [root@ ]# ls -lsah vm1.img 8.0Z -rw-r--r--. 1 root root 1.0T Apr 2 00:45 vm1.img --- Additional comment from Krutika Dhananjay on 2019-04-11 06:07:35 UTC --- So I tried this locally and I am not hitting the issue - [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 10G Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img 10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 30G Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img 30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img Of course, I didn't go beyond 30G due to space constraints on my laptop. If you could share your setup where you're hitting this bug, I'll take a look. -Krutika --- Additional comment from SATHEESARAN on 2019-05-02 05:21:01 UTC --- (In reply to Krutika Dhananjay from comment #7) > So I tried this locally and I am not hitting the issue - > > [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc > /mnt/vm1.img 10G > Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc > [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img > 10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img > > [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc > /mnt/vm1.img 30G > Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc > [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img > 30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img > > Of course, I didn't go beyond 30G due to space constraints on my laptop. > > If you could share your setup where you're hitting this bug, I'll take a > look. > > -Krutika I could see this very consistenly in two fashions 1. Create VM image >= 1TB -------------------------- [root@rhsqa-grafton7 test]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc [root@ ]# ls -lsah vm1.img 10G -rw-r--r--. 1 root root 10G May 2 10:30 vm1.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm2.img 50G Formatting 'vm2.img', fmt=raw size=53687091200 preallocation=falloc [root@ ]# ls -lsah vm2.img 50G -rw-r--r--. 1 root root 50G May 2 10:30 vm2.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm3.img 100G Formatting 'vm3.img', fmt=raw size=107374182400 preallocation=falloc [root@ ]# ls -lsah vm3.img 100G -rw-r--r--. 1 root root 100G May 2 10:33 vm3.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm4.img 500G Formatting 'vm4.img', fmt=raw size=536870912000 preallocation=falloc [root@ ]# ls -lsah vm4.img 500G -rw-r--r--. 1 root root 500G May 2 10:33 vm4.img Once the size reached 1TB, you will see this issue [root@ ]# qemu-img create -f raw -o preallocation=falloc vm6.img 1T Formatting 'vm6.img', fmt=raw size=1099511627776 preallocation=falloc [root@ ]# ls -lsah vm6.img 8.0Z -rw-r--r--. 1 root root 1.0T May 2 10:35 vm6.img <-------- size on disk is too much than expected 2. Recreate the image with the same name ----------------------------------------- Observe that for the second time, the image is created with the same name [root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc [root@ ]# ls -lsah vm1.img 10G -rw-r--r--. 1 root root 10G May 2 10:40 vm1.img [root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 20G <-------- The same file name vm1.img is used Formatting 'vm1.img', fmt=raw size=21474836480 preallocation=falloc [root@ ]# ls -lsah vm1.img 30G -rw-r--r--. 1 root root 20G May 2 10:40 vm1.img <---------- size on the disk is 30G, though the file is created with 20G I will provide setup for the investigation --- Additional comment from SATHEESARAN on 2019-05-02 05:23:07 UTC --- The setup details: ------------------- rhsqa-grafton7.lab.eng.blr.redhat.com ( root/redhat ) volume: data ( replica 3, sharded ) The volume is currently mounted at: /mnt/test Note: This is the RHVH installation. @krutika, if you need more info, just ping me in IRC / google chat --- Additional comment from Krutika Dhananjay on 2019-05-02 10:16:40 UTC --- Found part of the issue. It's just a case of integer overflow. 32-bit signed int is being used to store delta between post-stat and pre-stat block-counts. The range of numbers for 32-bit signed int is [-2,147,483,648, 2,147,483,647] whereas the number of blocks allocated as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648 which is just 1 more than INT_MAX (2,147,483,647) which spills over to the negative half the scale making it -2,147,483,648. This number, on being copied to int64 causes the most-significant 32 bits to be filled with 1 making the block-count equal 554050781183 (or 0xffffffff80000000) in magnitude. That's the block-count that gets set on the backend in trusted.glusterfs.shard.file-size xattr in the block-count segment - [root@rhsqa-grafton7 data]# getfattr -d -m . -e hex /gluster_bricks/data/data/vm3.img getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/data/data/vm3.img security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6 trusted.gfid2path.6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f766d332e696d67 trusted.glusterfs.shard.block-size=0x0000000004000000 trusted.glusterfs.shard.file-size=0x00000100000000000000000000000000ffffffff800000000000000000000000 <-- notice the "ffffffff80000000" in the block-count segment But .. [root@rhsqa-grafton7 test]# stat vm3.img File: ‘vm3.img’ Size: 1099511627776 Blocks: 18446744071562067968 IO Block: 131072 regular file Device: 29h/41d Inode: 11473626732659815398 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Context: system_u:object_r:fusefs_t:s0 Access: 2019-05-02 14:11:11.693559069 +0530 Modify: 2019-05-02 14:12:38.245068328 +0530 Change: 2019-05-02 14:15:56.190546751 +0530 Birth: - stat shows block-count as 18446744071562067968 which is way bigger than (554050781183 * 512). In the response path, turns out the block-count further gets assigned to a uint64 number. The same number, when expressed as uint64 becomes 18446744071562067968. 18446744071562067968 * 512 is a whopping 8.0 Zettabytes! This bug wasn't seen earlier because the earlier way of preallocating files never used fallocate, so the original signed 32 int variable delta_blocks would never exceed 131072. Anyway, I'll be soon sending a fix for this. Sas, Do you have a single node with at least 1TB free space that you can lend me where I can test the fix? The bug will only be hit when the image size is > 1TB. -Krutika --- Additional comment from Krutika Dhananjay on 2019-05-02 10:18:26 UTC --- (In reply to Krutika Dhananjay from comment #10) > Found part of the issue. Sorry, this not part of the issue but THE issue in its entirety. (That line is from an older draft I'd composed which I forgot to change after rc'ing the bug) > > It's just a case of integer overflow. > 32-bit signed int is being used to store delta between post-stat and > pre-stat block-counts. > The range of numbers for 32-bit signed int is [-2,147,483,648, > 2,147,483,647] whereas the number of blocks allocated > as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648 > which is just 1 more than INT_MAX (2,147,483,647) > which spills over to the negative half the scale making it -2,147,483,648. > This number, on being copied to int64 causes the most-significant 32 bits to > be filled with 1 making the block-count equal 554050781183 (or > 0xffffffff80000000) in magnitude. > That's the block-count that gets set on the backend in > trusted.glusterfs.shard.file-size xattr in the block-count segment - > > [root@rhsqa-grafton7 data]# getfattr -d -m . -e hex > /gluster_bricks/data/data/vm3.img > getfattr: Removing leading '/' from absolute path names > # file: gluster_bricks/data/data/vm3.img > security. > selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f7 > 43a733000 > trusted.afr.dirty=0x000000000000000000000000 > trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6 > trusted.gfid2path. > 6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030 > 303030303030312f766d332e696d67 > > trusted.glusterfs.shard.block-size=0x0000000004000000 > trusted.glusterfs.shard.file- > size=0x00000100000000000000000000000000ffffffff800000000000000000000000 <-- > notice the "ffffffff80000000" in the block-count segment > > But .. > > [root@rhsqa-grafton7 test]# stat vm3.img > File: ‘vm3.img’ > Size: 1099511627776 Blocks: 18446744071562067968 IO Block: 131072 > regular file > Device: 29h/41d Inode: 11473626732659815398 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Context: system_u:object_r:fusefs_t:s0 > Access: 2019-05-02 14:11:11.693559069 +0530 > Modify: 2019-05-02 14:12:38.245068328 +0530 > Change: 2019-05-02 14:15:56.190546751 +0530 > Birth: - > > stat shows block-count as 18446744071562067968 which is way bigger than > (554050781183 * 512). > > In the response path, turns out the block-count further gets assigned to a > uint64 number. > The same number, when expressed as uint64 becomes 18446744071562067968. > 18446744071562067968 * 512 is a whopping 8.0 Zettabytes! > > This bug wasn't seen earlier because the earlier way of preallocating files > never used fallocate, so the original signed 32 int variable delta_blocks > would never exceed 131072. > > Anyway, I'll be soon sending a fix for this. --- Additional comment from Worker Ant on 2019-05-03 06:58:51 UTC --- REVIEW: https://review.gluster.org/22655 (features/shard: Fix integer overflow in block count accounting) posted (#1) for review on master by Krutika Dhananjay --- Additional comment from Worker Ant on 2019-05-06 10:49:43 UTC --- REVIEW: https://review.gluster.org/22655 (features/shard: Fix integer overflow in block count accounting) merged (#2) on master by Xavi Hernandez --- Additional comment from Worker Ant on 2019-05-08 08:46:18 UTC --- REVIEW: https://review.gluster.org/22681 (features/shard: Fix block-count accounting upon truncate to lower size) posted (#1) for review on master by Krutika Dhananjay --- Additional comment from Worker Ant on 2019-06-04 07:30:49 UTC --- REVIEW: https://review.gluster.org/22681 (features/shard: Fix block-count accounting upon truncate to lower size) merged (#6) on master by Xavi Hernandez
REVIEW: https://review.gluster.org/22817 (features/shard: Fix integer overflow in block count accounting) posted (#1) for review on release-6 by Krutika Dhananjay
REVIEW: https://review.gluster.org/22819 (features/shard: Fix block-count accounting upon truncate to lower size) posted (#1) for review on release-6 by Krutika Dhananjay
REVIEW: https://review.gluster.org/22817 (features/shard: Fix integer overflow in block count accounting) merged (#4) on release-6 by hari gowtham
REVIEW: https://review.gluster.org/22819 (features/shard: Fix block-count accounting upon truncate to lower size) merged (#2) on release-6 by hari gowtham