Bug 1332861

Summary:	Total disk usage size of an image from mount point and brick backend differs
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RamaKasturi <knarra>
Component:	sharding	Assignee:	Krutika Dhananjay <kdhananj>
Status:	CLOSED WONTFIX	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	bmohanra, jbyers, rcyriac, rhinduja, sabose, sasundar
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	Sharding relies on block count difference before and after every write as gotten from the underlying file system and adds that to the existing block count of a sharded file. But XFS' speculative preallocation of blocks causes this accounting to go bad as it may so happen that with speculative preallocation the block count of the shards after the write projected by xfs could be greater than the number of blocks actually written to. Due to this, the block-count of a sharded file might sometimes be projected to be higher than the actual number of blocks consumed on disk. As a result, commands like du -sh might report higher size than the actual number of physical blocks used by the file.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-16 18:16:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1311843

Description RamaKasturi 2016-05-04 08:52:13 UTC

Description of problem:
I see that total disk usage(when du -sch is run) differs for the vm image which is seen in mount point and the backend bricks.

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-2.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install HC 
2. Now create a vm from RHEV-UI
3. check the total disk usage of the os image from the mount point by running du -sch <image>
4. check the total disk usage of the same os image from the backend brick by performing du -sch <all shards> <first shard file>

Actual results:
size differs from the backend brick to what is seen in the mount point.

Expected results:
size should not be different as the same image gets sharded.

Additional info:

Comment 2 SATHEESARAN 2016-05-06 06:33:48 UTC

Size of the image file as seen from the fuse mount
---------------------------------------------------
[root@]# qemu-img info 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
image: 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
file format: raw
virtual size: 40G (42949672960 bytes)
disk size: 38G

[root@]# ls -salh 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
38G -rw-rw----. 1 vdsm kvm 40G May  6 11:37 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c

Size of all the shards on the brick
------------------------------------

[root@]# getfattr -d -m. -ehex 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
# file: 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000057278aea000808ce
trusted.gfid=0x45997be9361b413fa6a06fcf043eb28f
trusted.glusterfs.427f3752-15b2-4921-ac24-1b4c06e792f4.xtime=0x572c348d000562c9
trusted.glusterfs.shard.block-size=0x0000000020000000
trusted.glusterfs.shard.file-size=0x0000000a0000000000000000000000000000000004bbfe8d0000000000000000

[root@]# du -sh ../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.* 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c -c
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.1
509M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.10
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.11
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.12
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.13
510M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.14
512M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.15
512M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.16
510M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.17
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.18
508M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.19
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.2
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.20
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.21
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.22
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.23
511M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.24
512M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.25
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.26
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.27
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.28
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.29
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.3
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.30
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.31
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.32
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.33
509M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.34
511M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.35
511M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.36
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.37
509M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.38
509M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.39
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.4
511M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.40
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.41
510M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.42
510M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.43
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.44
438M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.45
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.5
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.6
16K	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.60
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.7
128K	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.79
513M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.8
511M	../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.9
512M	5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
23G	total

I think the issue here is that inside the VM, few files are deleted, and so the size of all the shards reflect it.
But from fuse, the image file never shrunk back to the original size ( probably discard/blkdiscard should help )

Comment 4 Krutika Dhananjay 2016-05-06 07:39:39 UTC

I tried this with a simple dd command on a sharded volume:

# dd if=/dev/urandom of=file2 bs=1024 seek=2048 count=1024 conv=notrunc
The command basically seek()s to the 2nd MB and writes 1M data.
du -sh should have shown 1M but it reports 2M instead.

I added more logs in sharding and arrived at this: http://paste.fedoraproject.org/363254/51571514/

Some background:
Sharding stores file size and block count in an xattr. It is updated in inode write operations (write(), truncate() etc). The way it calculates the block count is by adding up the deltas between the prebuf (file attrs before the wound write) and postbuf (file attrs after the write) of individual shards participating in that write operations, adding them up and adding them to the block-count portion of the xattr.

In my test, I found that for a write of size 1024 bytes, shard got a block count delta of 16 block - a number too high for a 1K write. Turns out this is because of xfs doing preallocation of blocks and sharding depending on this for counting the blocks (and needless to say du relies on block count). Sometime later xfs releases the extra allocated blocks but too late, since sharding has already persisted them in the xattr and heavily relies on it.

Yet to think of a fix for this. It's not so straightforward.

Comment 5 Krutika Dhananjay 2016-05-09 13:09:30 UTC

We had a call with Brian Foster (XFS team) to know more about XFS' speculative preallocation and here is the summary of the discussion:

1) One way to work around XFS speculative preallocation is by making shard translator truncate each shard at the time of creation to the shard-block-size (512MB in our case). In fact I tested the du -sh issue with this change and the output was accurate -- in line with the actual number of blocks consumed, excluding the holes. But such a change can cause fragmentation of the underlying disk and over time only lead to bad performance.

2) storage/posix translator has a change (in the function iatt_from_stat()) that tries to use the file_size to predict the actual number of 512-byte blocks consumed, irrespective of the extra allocation of blocks done by XFS. Unfortunately this doesn't give the accurate block-size for a sparse file. But fortunately, in the worst-case scenario, the block-count (the parameter that du -sh relies on) would be calculated as if the file does not have any holes. In other words, the du -sh output would show the disk usage by the file to be equal to the file size. For instance, for a 40GB vm image with 35GB worth of actual data and the remaining 5G consisting of holes, `du -sh` would show 40GB to be the disk usage of the file and can never exceed this number.

There are of course other ways to fix this issue - like doing lazy update to the block-count xattr (when the block-count has hopefully reached a stable value), or writing a translator to be loaded on the brick stack that would remember the last seen block-count in memory and on witnessing a decrease in the block-count as part of a subsequent operation, calculate the delta, send an upcall notification to the shard translator on the client and have it send an updated xattr value for persisting. But all these changes are much more involved (some of them could have performance implications) and cannot be delivered for LA. What we *can* do is to document this behavior and what to expect in the worst case scenario.

-Krutika

Comment 6 Sahina Bose 2016-05-11 09:15:41 UTC

To be documented as known issue

Comment 9 RamaKasturi 2016-05-12 14:08:14 UTC

I did a test by adding 50G disk to the vm and then writing 50GB data to the disk using dd command. I see that the dd operation is complete and following are the values i see in my mount point.

[root@tettnang c880b513-ba36-4fbd-ae9a-593acc2e820a]# ls -lsah *
 54G -rw-rw----. 1 vdsm kvm  50G May 11 17:28 ef6d5385-aafe-4072-b3de-02ed91621c7c
1.0M -rw-rw----. 1 vdsm kvm 1.0M May 11 15:17 ef6d5385-aafe-4072-b3de-02ed91621c7c.lease
 512 -rw-r--r--. 1 vdsm kvm  323 May 11 15:17 ef6d5385-aafe-4072-b3de-02ed91621c7c.meta

Actual disk size is 50G where as, du -sh shows that it is 54G which is not supposed to be shown

Comment 10 Krutika Dhananjay 2016-05-12 14:41:43 UTC

(In reply to RamaKasturi from comment #9)
> I did a test by adding 50G disk to the vm and then writing 50GB data to the
> disk using dd command. I see that the dd operation is complete and following
> are the values i see in my mount point.
> 
> [root@tettnang c880b513-ba36-4fbd-ae9a-593acc2e820a]# ls -lsah *
>  54G -rw-rw----. 1 vdsm kvm  50G May 11 17:28
> ef6d5385-aafe-4072-b3de-02ed91621c7c
> 1.0M -rw-rw----. 1 vdsm kvm 1.0M May 11 15:17
> ef6d5385-aafe-4072-b3de-02ed91621c7c.lease
>  512 -rw-r--r--. 1 vdsm kvm  323 May 11 15:17
> ef6d5385-aafe-4072-b3de-02ed91621c7c.meta
> 
> Actual disk size is 50G where as, du -sh shows that it is 54G which is not
> supposed to be shown

Right, it is possible. Sorry I couldn't think of this earlier. If a write #n on a shard happens in a way that it creates holes in the file, and xfs preallocates some blocks, then these preallocated blocks are counted in block-count xattr. After sometime xfs releases these blocks. Then when a write #n+1 is sent on this region of the file that was previously preallocated and then deallocated, we end up counting these blocks again. So blocks in all the regions of a shard can be counted twice in the worst case, theoretically.

Comment 11 SATHEESARAN 2016-05-31 02:52:31 UTC

This bug should be in the sharding component and changing it accordingly