DescriptionKrutika Dhananjay
2019-05-03 06:44:21 UTC
+++ This bug was initially created as a clone of Bug #1668001 +++
Description of problem:
-----------------------
The size of the VM image file as reported from the fuse mount is incorrect.
For the file of size 1 TB, the size of the file on the disk is reported as 8 ZB.
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
upstream master
How reproducible:
------------------
Always
Steps to Reproduce:
-------------------
1. On the Gluster storage domain, create the preallocated disk image of size 1TB
2. Check for the size of the file after its creation has succeesded
Actual results:
---------------
Size of the file is reported as 8 ZB, though the size of the file is 1TB
Expected results:
-----------------
Size of the file should be the same as the size created by the user
Additional info:
----------------
Volume in the question is replica 3 sharded
[root@rhsqa-grafton10 ~]# gluster volume info data
Volume Name: data
Type: Replicate
Volume ID: 7eb49e90-e2b6-4f8f-856e-7108212dbb72
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: rhsqa-grafton10.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick2: rhsqa-grafton11.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick3: rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data (arbiter)
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
cluster.enable-shared-storage: enable
--- Additional comment from SATHEESARAN on 2019-01-21 16:32:39 UTC ---
Size of the file as reported from the fuse mount:
[root@ ~]# ls -lsah /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
8.0Z -rw-rw----. 1 vdsm kvm 1.1T Jan 21 17:14 /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
[root@ ~]# du -shc /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
16E /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
16E total
Note that the disk image is preallocated with 1072GB of space
--- Additional comment from SATHEESARAN on 2019-04-01 19:25:15 UTC ---
(In reply to SATHEESARAN from comment #5)
> (In reply to Krutika Dhananjay from comment #3)
> > Also, do you still have the setup in this state? If so, can I'd like to take
> > a look.
> >
> > -Krutika
>
> Hi Krutika,
>
> The setup is no longer available. Let me recreate the issue and provide you
> the setup
This issue is very easily reproducible. Create a preallocated image on the replicate volume with sharding enabled.
Use 'qemu-img' to create the VM image.
See the following test:
[root@ ~]# qemu-img create -f raw -o preallocation=falloc /mnt/test/vm1.img 1T
Formatting '/mnt/test/vm1.img', fmt=raw size=1099511627776 preallocation='falloc'
[root@ ]# ls /mnt/test
vm1.img
[root@ ]# ls -lsah vm1.img
8.0Z -rw-r--r--. 1 root root 1.0T Apr 2 00:45 vm1.img
--- Additional comment from Krutika Dhananjay on 2019-04-11 06:07:35 UTC ---
So I tried this locally and I am not hitting the issue -
[root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 10G
Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc
[root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img
[root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 30G
Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc
[root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img
Of course, I didn't go beyond 30G due to space constraints on my laptop.
If you could share your setup where you're hitting this bug, I'll take a look.
-Krutika
--- Additional comment from SATHEESARAN on 2019-05-02 05:21:01 UTC ---
(In reply to Krutika Dhananjay from comment #7)
> So I tried this locally and I am not hitting the issue -
>
> [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc
> /mnt/vm1.img 10G
> Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc
> [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
> 10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img
>
> [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc
> /mnt/vm1.img 30G
> Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc
> [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
> 30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img
>
> Of course, I didn't go beyond 30G due to space constraints on my laptop.
>
> If you could share your setup where you're hitting this bug, I'll take a
> look.
>
> -Krutika
I could see this very consistenly in two fashions
1. Create VM image >= 1TB
--------------------------
[root@rhsqa-grafton7 test]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G
Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc
[root@ ]# ls -lsah vm1.img
10G -rw-r--r--. 1 root root 10G May 2 10:30 vm1.img
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm2.img 50G
Formatting 'vm2.img', fmt=raw size=53687091200 preallocation=falloc
[root@ ]# ls -lsah vm2.img
50G -rw-r--r--. 1 root root 50G May 2 10:30 vm2.img
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm3.img 100G
Formatting 'vm3.img', fmt=raw size=107374182400 preallocation=falloc
[root@ ]# ls -lsah vm3.img
100G -rw-r--r--. 1 root root 100G May 2 10:33 vm3.img
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm4.img 500G
Formatting 'vm4.img', fmt=raw size=536870912000 preallocation=falloc
[root@ ]# ls -lsah vm4.img
500G -rw-r--r--. 1 root root 500G May 2 10:33 vm4.img
Once the size reached 1TB, you will see this issue
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm6.img 1T
Formatting 'vm6.img', fmt=raw size=1099511627776 preallocation=falloc
[root@ ]# ls -lsah vm6.img
8.0Z -rw-r--r--. 1 root root 1.0T May 2 10:35 vm6.img <-------- size on disk is too much than expected
2. Recreate the image with the same name
-----------------------------------------
Observe that for the second time, the image is created with the same name
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G
Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc
[root@ ]# ls -lsah vm1.img
10G -rw-r--r--. 1 root root 10G May 2 10:40 vm1.img
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 20G <-------- The same file name vm1.img is used
Formatting 'vm1.img', fmt=raw size=21474836480 preallocation=falloc
[root@ ]# ls -lsah vm1.img
30G -rw-r--r--. 1 root root 20G May 2 10:40 vm1.img <---------- size on the disk is 30G, though the file is created with 20G
I will provide setup for the investigation
--- Additional comment from SATHEESARAN on 2019-05-02 05:23:07 UTC ---
The setup details:
-------------------
rhsqa-grafton7.lab.eng.blr.redhat.com ( root/redhat )
volume: data ( replica 3, sharded )
The volume is currently mounted at: /mnt/test
Note: This is the RHVH installation.
@krutika, if you need more info, just ping me in IRC / google chat
--- Additional comment from Krutika Dhananjay on 2019-05-02 10:16:40 UTC ---
Found part of the issue.
It's just a case of integer overflow.
32-bit signed int is being used to store delta between post-stat and pre-stat block-counts.
The range of numbers for 32-bit signed int is [-2,147,483,648, 2,147,483,647] whereas the number of blocks allocated
as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648 which is just 1 more than INT_MAX (2,147,483,647)
which spills over to the negative half the scale making it -2,147,483,648.
This number, on being copied to int64 causes the most-significant 32 bits to be filled with 1 making the block-count equal 554050781183 (or 0xffffffff80000000) in magnitude.
That's the block-count that gets set on the backend in trusted.glusterfs.shard.file-size xattr in the block-count segment -
[root@rhsqa-grafton7 data]# getfattr -d -m . -e hex /gluster_bricks/data/data/vm3.img
getfattr: Removing leading '/' from absolute path names
# file: gluster_bricks/data/data/vm3.img
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6
trusted.gfid2path.6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f766d332e696d67
trusted.glusterfs.shard.block-size=0x0000000004000000
trusted.glusterfs.shard.file-size=0x00000100000000000000000000000000ffffffff800000000000000000000000 <-- notice the "ffffffff80000000" in the block-count segment
But ..
[root@rhsqa-grafton7 test]# stat vm3.img
File: ‘vm3.img’
Size: 1099511627776 Blocks: 18446744071562067968 IO Block: 131072 regular file
Device: 29h/41d Inode: 11473626732659815398 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Context: system_u:object_r:fusefs_t:s0
Access: 2019-05-02 14:11:11.693559069 +0530
Modify: 2019-05-02 14:12:38.245068328 +0530
Change: 2019-05-02 14:15:56.190546751 +0530
Birth: -
stat shows block-count as 18446744071562067968 which is way bigger than (554050781183 * 512).
In the response path, turns out the block-count further gets assigned to a uint64 number.
The same number, when expressed as uint64 becomes 18446744071562067968.
18446744071562067968 * 512 is a whopping 8.0 Zettabytes!
This bug wasn't seen earlier because the earlier way of preallocating files never used fallocate, so the original signed 32 int variable delta_blocks would never exceed 131072.
Anyway, I'll be soon sending a fix for this.
Sas,
Do you have a single node with at least 1TB free space that you can lend me where I can test the fix? The bug will only be hit when the image size is > 1TB.
-Krutika
--- Additional comment from Krutika Dhananjay on 2019-05-02 10:18:26 UTC ---
(In reply to Krutika Dhananjay from comment #10)
> Found part of the issue.
Sorry, this not part of the issue but THE issue in its entirety. (That line is from an older draft I'd composed which I forgot to change after rc'ing the bug)
>
> It's just a case of integer overflow.
> 32-bit signed int is being used to store delta between post-stat and
> pre-stat block-counts.
> The range of numbers for 32-bit signed int is [-2,147,483,648,
> 2,147,483,647] whereas the number of blocks allocated
> as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648
> which is just 1 more than INT_MAX (2,147,483,647)
> which spills over to the negative half the scale making it -2,147,483,648.
> This number, on being copied to int64 causes the most-significant 32 bits to
> be filled with 1 making the block-count equal 554050781183 (or
> 0xffffffff80000000) in magnitude.
> That's the block-count that gets set on the backend in
> trusted.glusterfs.shard.file-size xattr in the block-count segment -
>
> [root@rhsqa-grafton7 data]# getfattr -d -m . -e hex
> /gluster_bricks/data/data/vm3.img
> getfattr: Removing leading '/' from absolute path names
> # file: gluster_bricks/data/data/vm3.img
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f7
> 43a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6
> trusted.gfid2path.
> 6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030
> 303030303030312f766d332e696d67
>
> trusted.glusterfs.shard.block-size=0x0000000004000000
> trusted.glusterfs.shard.file-
> size=0x00000100000000000000000000000000ffffffff800000000000000000000000 <--
> notice the "ffffffff80000000" in the block-count segment
>
> But ..
>
> [root@rhsqa-grafton7 test]# stat vm3.img
> File: ‘vm3.img’
> Size: 1099511627776 Blocks: 18446744071562067968 IO Block: 131072
> regular file
> Device: 29h/41d Inode: 11473626732659815398 Links: 1
> Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
> Context: system_u:object_r:fusefs_t:s0
> Access: 2019-05-02 14:11:11.693559069 +0530
> Modify: 2019-05-02 14:12:38.245068328 +0530
> Change: 2019-05-02 14:15:56.190546751 +0530
> Birth: -
>
> stat shows block-count as 18446744071562067968 which is way bigger than
> (554050781183 * 512).
>
> In the response path, turns out the block-count further gets assigned to a
> uint64 number.
> The same number, when expressed as uint64 becomes 18446744071562067968.
> 18446744071562067968 * 512 is a whopping 8.0 Zettabytes!
>
> This bug wasn't seen earlier because the earlier way of preallocating files
> never used fallocate, so the original signed 32 int variable delta_blocks
> would never exceed 131072.
>
> Anyway, I'll be soon sending a fix for this.
REVIEW: https://review.gluster.org/22655 (features/shard: Fix integer overflow in block count accounting) posted (#1) for review on master by Krutika Dhananjay
REVIEW: https://review.gluster.org/22681 (features/shard: Fix block-count accounting upon truncate to lower size) posted (#1) for review on master by Krutika Dhananjay
REVIEW: https://review.gluster.org/22681 (features/shard: Fix block-count accounting upon truncate to lower size) merged (#6) on master by Xavi Hernandez