Bug 1705884

Summary:	Image size as reported from the fuse mount is incorrect
Product:	[Community] GlusterFS	Reporter:	Krutika Dhananjay <kdhananj>
Component:	sharding	Assignee:	bugs <bugs>
Status:	CLOSED NEXTRELEASE	QA Contact:	bugs <bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	bugs, kdhananj, pasik, rhs-bugs, sabose, sankarshan, sasundar, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1668001
Clones:	1716871 (view as bug list)		Environment:
Last Closed:	2019-06-04 07:30:49 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1667998, 1668001, 1716871

Description Krutika Dhananjay 2019-05-03 06:44:21 UTC

+++ This bug was initially created as a clone of Bug #1668001 +++

Description of problem:
-----------------------
The size of the VM image file as reported from the fuse mount is incorrect.
For the file of size 1 TB, the size of the file on the disk is reported as 8 ZB.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
upstream master

How reproducible:
------------------
Always

Steps to Reproduce:
-------------------
1. On the Gluster storage domain, create the preallocated disk image of size 1TB
2. Check for the size of the file after its creation has succeesded

Actual results:
---------------
Size of the file is reported as 8 ZB, though the size of the file is 1TB

Expected results:
-----------------
Size of the file should be the same as the size created by the user


Additional info:
----------------
Volume in the question is replica 3 sharded
[root@rhsqa-grafton10 ~]# gluster volume info data
 
Volume Name: data
Type: Replicate
Volume ID: 7eb49e90-e2b6-4f8f-856e-7108212dbb72
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: rhsqa-grafton10.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick2: rhsqa-grafton11.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick3: rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data (arbiter)
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
cluster.enable-shared-storage: enable

--- Additional comment from SATHEESARAN on 2019-01-21 16:32:39 UTC ---

Size of the file as reported from the fuse mount:

[root@ ~]# ls -lsah /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b 
8.0Z -rw-rw----. 1 vdsm kvm 1.1T Jan 21 17:14 /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b

[root@ ~]# du -shc /rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com\:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
16E	/rhev/data-center/mnt/glusterSD/rhsqa-grafton10.lab.eng.blr.redhat.com:_data/bbeee86f-f174-4ec7-9ea3-a0df28709e64/images/0206953c-4850-4969-9dad-15140579d354/eaa5e81d-103c-4ce6-947e-8946806cca1b
16E	total

Note that the disk image is preallocated with 1072GB of space

--- Additional comment from SATHEESARAN on 2019-04-01 19:25:15 UTC ---

(In reply to SATHEESARAN from comment #5)
> (In reply to Krutika Dhananjay from comment #3)
> > Also, do you still have the setup in this state? If so, can I'd like to take
> > a look.
> > 
> > -Krutika
> 
> Hi Krutika,
> 
> The setup is no longer available. Let me recreate the issue and provide you
> the setup

This issue is very easily reproducible. Create a preallocated image on the replicate volume with sharding enabled.
Use 'qemu-img' to create the VM image.

See the following test:
[root@ ~]# qemu-img create -f raw -o preallocation=falloc /mnt/test/vm1.img 1T
Formatting '/mnt/test/vm1.img', fmt=raw size=1099511627776 preallocation='falloc' 

[root@ ]# ls /mnt/test
vm1.img

[root@ ]# ls -lsah vm1.img 
8.0Z -rw-r--r--. 1 root root 1.0T Apr  2 00:45 vm1.img

--- Additional comment from Krutika Dhananjay on 2019-04-11 06:07:35 UTC ---

So I tried this locally and I am not hitting the issue -

[root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 10G
Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc
[root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img

[root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 30G
Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc
[root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img

Of course, I didn't go beyond 30G due to space constraints on my laptop.

If you could share your setup where you're hitting this bug, I'll take a look.

-Krutika

--- Additional comment from SATHEESARAN on 2019-05-02 05:21:01 UTC ---

(In reply to Krutika Dhananjay from comment #7)
> So I tried this locally and I am not hitting the issue -
> 
> [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc
> /mnt/vm1.img 10G
> Formatting '/mnt/vm1.img', fmt=raw size=10737418240 preallocation=falloc
> [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
> 10G -rw-r--r--. 1 root root 10G Apr 11 11:26 /mnt/vm1.img
> 
> [root@dhcpxxxxx ~]# qemu-img create -f raw -o preallocation=falloc
> /mnt/vm1.img 30G
> Formatting '/mnt/vm1.img', fmt=raw size=32212254720 preallocation=falloc
> [root@dhcpxxxxx ~]# ls -lsah /mnt/vm1.img
> 30G -rw-r--r--. 1 root root 30G Apr 11 11:32 /mnt/vm1.img
> 
> Of course, I didn't go beyond 30G due to space constraints on my laptop.
> 
> If you could share your setup where you're hitting this bug, I'll take a
> look.
> 
> -Krutika

I could see this very consistenly in two fashions

1. Create VM image >= 1TB
--------------------------
[root@rhsqa-grafton7 test]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G
Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc

[root@ ]# ls -lsah vm1.img 
10G -rw-r--r--. 1 root root 10G May  2 10:30 vm1.img

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm2.img 50G
Formatting 'vm2.img', fmt=raw size=53687091200 preallocation=falloc

[root@ ]# ls -lsah vm2.img 
50G -rw-r--r--. 1 root root 50G May  2 10:30 vm2.img

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm3.img 100G
Formatting 'vm3.img', fmt=raw size=107374182400 preallocation=falloc

[root@ ]# ls -lsah vm3.img 
100G -rw-r--r--. 1 root root 100G May  2 10:33 vm3.img

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm4.img 500G
Formatting 'vm4.img', fmt=raw size=536870912000 preallocation=falloc

[root@ ]# ls -lsah vm4.img 
500G -rw-r--r--. 1 root root 500G May  2 10:33 vm4.img

Once the size reached 1TB, you will see this issue
[root@ ]# qemu-img create -f raw -o preallocation=falloc vm6.img 1T
Formatting 'vm6.img', fmt=raw size=1099511627776 preallocation=falloc

[root@ ]# ls -lsah vm6.img 
8.0Z -rw-r--r--. 1 root root 1.0T May  2 10:35 vm6.img            <-------- size on disk is too much than expected

2. Recreate the image with the same name
-----------------------------------------
Observe that for the second time, the image is created with the same name 

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 10G
Formatting 'vm1.img', fmt=raw size=10737418240 preallocation=falloc

[root@ ]# ls -lsah vm1.img
10G -rw-r--r--. 1 root root 10G May  2 10:40 vm1.img

[root@ ]# qemu-img create -f raw -o preallocation=falloc vm1.img 20G <-------- The same file name vm1.img is used
Formatting 'vm1.img', fmt=raw size=21474836480 preallocation=falloc

[root@ ]# ls -lsah vm1.img 
30G -rw-r--r--. 1 root root 20G May  2 10:40 vm1.img      <---------- size on the disk is 30G, though the file is created with 20G

I will provide setup for the investigation

--- Additional comment from SATHEESARAN on 2019-05-02 05:23:07 UTC ---

The setup details:
-------------------

rhsqa-grafton7.lab.eng.blr.redhat.com ( root/redhat )
volume: data ( replica 3, sharded )
The volume is currently mounted at: /mnt/test

Note: This is the RHVH installation.

@krutika, if you need more info, just ping me in IRC / google chat

--- Additional comment from Krutika Dhananjay on 2019-05-02 10:16:40 UTC ---

Found part of the issue.

It's just a case of integer overflow.
32-bit signed int is being used to store delta between post-stat and pre-stat block-counts.
The range of numbers for 32-bit signed int is [-2,147,483,648, 2,147,483,647] whereas the number of blocks allocated
as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648 which is just 1 more than INT_MAX (2,147,483,647)
which spills over to the negative half the scale making it -2,147,483,648.
This number, on being copied to int64 causes the most-significant 32 bits to be filled with 1 making the block-count equal 554050781183 (or 0xffffffff80000000) in magnitude.
That's the block-count that gets set on the backend in trusted.glusterfs.shard.file-size xattr in the block-count segment -

[root@rhsqa-grafton7 data]# getfattr -d -m . -e hex /gluster_bricks/data/data/vm3.img
getfattr: Removing leading '/' from absolute path names
# file: gluster_bricks/data/data/vm3.img
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6
trusted.gfid2path.6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f766d332e696d67                                                                                                                
trusted.glusterfs.shard.block-size=0x0000000004000000
trusted.glusterfs.shard.file-size=0x00000100000000000000000000000000ffffffff800000000000000000000000  <-- notice the "ffffffff80000000" in the block-count segment

But ..

[root@rhsqa-grafton7 test]# stat vm3.img
  File: ‘vm3.img’
  Size: 1099511627776   Blocks: 18446744071562067968 IO Block: 131072 regular file
Device: 29h/41d Inode: 11473626732659815398  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:fusefs_t:s0
Access: 2019-05-02 14:11:11.693559069 +0530
Modify: 2019-05-02 14:12:38.245068328 +0530
Change: 2019-05-02 14:15:56.190546751 +0530
 Birth: -

stat shows block-count as 18446744071562067968 which is way bigger than (554050781183 * 512).

In the response path, turns out the block-count further gets assigned to a uint64 number.
The same number, when expressed as uint64 becomes 18446744071562067968.
18446744071562067968 * 512 is a whopping 8.0 Zettabytes!

This bug wasn't seen earlier because the earlier way of preallocating files never used fallocate, so the original signed 32 int variable delta_blocks would never exceed 131072.

Anyway, I'll be soon sending a fix for this.

Sas,

Do you have a single node with at least 1TB free space that you can lend me where I can test the fix? The bug will only be hit when the image size is > 1TB.

-Krutika

--- Additional comment from Krutika Dhananjay on 2019-05-02 10:18:26 UTC ---

(In reply to Krutika Dhananjay from comment #10)
> Found part of the issue.

Sorry, this not part of the issue but THE issue in its entirety. (That line is from an older draft I'd composed which I forgot to change after rc'ing the bug)

> 
> It's just a case of integer overflow.
> 32-bit signed int is being used to store delta between post-stat and
> pre-stat block-counts.
> The range of numbers for 32-bit signed int is [-2,147,483,648,
> 2,147,483,647] whereas the number of blocks allocated
> as part of creating a preallocated 1TB file is (1TB/512) = 2,147,483,648
> which is just 1 more than INT_MAX (2,147,483,647)
> which spills over to the negative half the scale making it -2,147,483,648.
> This number, on being copied to int64 causes the most-significant 32 bits to
> be filled with 1 making the block-count equal 554050781183 (or
> 0xffffffff80000000) in magnitude.
> That's the block-count that gets set on the backend in
> trusted.glusterfs.shard.file-size xattr in the block-count segment -
> 
> [root@rhsqa-grafton7 data]# getfattr -d -m . -e hex
> /gluster_bricks/data/data/vm3.img
> getfattr: Removing leading '/' from absolute path names
> # file: gluster_bricks/data/data/vm3.img
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f7
> 43a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.gfid=0x3faffa7142b74e739f3a82b9359d33e6
> trusted.gfid2path.
> 6356251b968111ad=0x30303030303030302d303030302d303030302d303030302d3030303030
> 303030303030312f766d332e696d67                                              
> 
> trusted.glusterfs.shard.block-size=0x0000000004000000
> trusted.glusterfs.shard.file-
> size=0x00000100000000000000000000000000ffffffff800000000000000000000000  <--
> notice the "ffffffff80000000" in the block-count segment
> 
> But ..
> 
> [root@rhsqa-grafton7 test]# stat vm3.img
>   File: ‘vm3.img’
>   Size: 1099511627776   Blocks: 18446744071562067968 IO Block: 131072
> regular file
> Device: 29h/41d Inode: 11473626732659815398  Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:fusefs_t:s0
> Access: 2019-05-02 14:11:11.693559069 +0530
> Modify: 2019-05-02 14:12:38.245068328 +0530
> Change: 2019-05-02 14:15:56.190546751 +0530
>  Birth: -
> 
> stat shows block-count as 18446744071562067968 which is way bigger than
> (554050781183 * 512).
> 
> In the response path, turns out the block-count further gets assigned to a
> uint64 number.
> The same number, when expressed as uint64 becomes 18446744071562067968.
> 18446744071562067968 * 512 is a whopping 8.0 Zettabytes!
> 
> This bug wasn't seen earlier because the earlier way of preallocating files
> never used fallocate, so the original signed 32 int variable delta_blocks
> would never exceed 131072.
> 
> Anyway, I'll be soon sending a fix for this.

Comment 1 Worker Ant 2019-05-03 06:58:51 UTC

REVIEW: https://review.gluster.org/22655 (features/shard: Fix integer overflow in block count accounting) posted (#1) for review on master by Krutika Dhananjay

Comment 2 Worker Ant 2019-05-06 10:49:43 UTC

REVIEW: https://review.gluster.org/22655 (features/shard: Fix integer overflow in block count accounting) merged (#2) on master by Xavi Hernandez

Comment 3 Worker Ant 2019-05-08 08:46:18 UTC

REVIEW: https://review.gluster.org/22681 (features/shard: Fix block-count accounting upon truncate to lower size) posted (#1) for review on master by Krutika Dhananjay

Comment 4 Worker Ant 2019-06-04 07:30:49 UTC

REVIEW: https://review.gluster.org/22681 (features/shard: Fix block-count accounting upon truncate to lower size) merged (#6) on master by Xavi Hernandez