Description of problem:
I see that total disk usage(when du -sch is run) differs for the vm image which is seen in mount point and the backend bricks.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install HC
2. Now create a vm from RHEV-UI
3. check the total disk usage of the os image from the mount point by running du -sch <image>
4. check the total disk usage of the same os image from the backend brick by performing du -sch <all shards> <first shard file>
size differs from the backend brick to what is seen in the mount point.
size should not be different as the same image gets sharded.
Size of the image file as seen from the fuse mount
[root@]# qemu-img info 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
file format: raw
virtual size: 40G (42949672960 bytes)
disk size: 38G
[root@]# ls -salh 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
38G -rw-rw----. 1 vdsm kvm 40G May 6 11:37 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
Size of all the shards on the brick
[root@]# getfattr -d -m. -ehex 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
# file: 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c
[root@]# du -sh ../../../.shard/45997be9-361b-413f-a6a0-6fcf043eb28f.* 5d7d9ca4-61bf-4b93-ba49-89f1e7c5ff0c -c
I think the issue here is that inside the VM, few files are deleted, and so the size of all the shards reflect it.
But from fuse, the image file never shrunk back to the original size ( probably discard/blkdiscard should help )
I tried this with a simple dd command on a sharded volume:
# dd if=/dev/urandom of=file2 bs=1024 seek=2048 count=1024 conv=notrunc
The command basically seek()s to the 2nd MB and writes 1M data.
du -sh should have shown 1M but it reports 2M instead.
I added more logs in sharding and arrived at this: http://paste.fedoraproject.org/363254/51571514/
Sharding stores file size and block count in an xattr. It is updated in inode write operations (write(), truncate() etc). The way it calculates the block count is by adding up the deltas between the prebuf (file attrs before the wound write) and postbuf (file attrs after the write) of individual shards participating in that write operations, adding them up and adding them to the block-count portion of the xattr.
In my test, I found that for a write of size 1024 bytes, shard got a block count delta of 16 block - a number too high for a 1K write. Turns out this is because of xfs doing preallocation of blocks and sharding depending on this for counting the blocks (and needless to say du relies on block count). Sometime later xfs releases the extra allocated blocks but too late, since sharding has already persisted them in the xattr and heavily relies on it.
Yet to think of a fix for this. It's not so straightforward.
We had a call with Brian Foster (XFS team) to know more about XFS' speculative preallocation and here is the summary of the discussion:
1) One way to work around XFS speculative preallocation is by making shard translator truncate each shard at the time of creation to the shard-block-size (512MB in our case). In fact I tested the du -sh issue with this change and the output was accurate -- in line with the actual number of blocks consumed, excluding the holes. But such a change can cause fragmentation of the underlying disk and over time only lead to bad performance.
2) storage/posix translator has a change (in the function iatt_from_stat()) that tries to use the file_size to predict the actual number of 512-byte blocks consumed, irrespective of the extra allocation of blocks done by XFS. Unfortunately this doesn't give the accurate block-size for a sparse file. But fortunately, in the worst-case scenario, the block-count (the parameter that du -sh relies on) would be calculated as if the file does not have any holes. In other words, the du -sh output would show the disk usage by the file to be equal to the file size. For instance, for a 40GB vm image with 35GB worth of actual data and the remaining 5G consisting of holes, `du -sh` would show 40GB to be the disk usage of the file and can never exceed this number.
There are of course other ways to fix this issue - like doing lazy update to the block-count xattr (when the block-count has hopefully reached a stable value), or writing a translator to be loaded on the brick stack that would remember the last seen block-count in memory and on witnessing a decrease in the block-count as part of a subsequent operation, calculate the delta, send an upcall notification to the shard translator on the client and have it send an updated xattr value for persisting. But all these changes are much more involved (some of them could have performance implications) and cannot be delivered for LA. What we *can* do is to document this behavior and what to expect in the worst case scenario.
To be documented as known issue
I did a test by adding 50G disk to the vm and then writing 50GB data to the disk using dd command. I see that the dd operation is complete and following are the values i see in my mount point.
[root@tettnang c880b513-ba36-4fbd-ae9a-593acc2e820a]# ls -lsah *
54G -rw-rw----. 1 vdsm kvm 50G May 11 17:28 ef6d5385-aafe-4072-b3de-02ed91621c7c
1.0M -rw-rw----. 1 vdsm kvm 1.0M May 11 15:17 ef6d5385-aafe-4072-b3de-02ed91621c7c.lease
512 -rw-r--r--. 1 vdsm kvm 323 May 11 15:17 ef6d5385-aafe-4072-b3de-02ed91621c7c.meta
Actual disk size is 50G where as, du -sh shows that it is 54G which is not supposed to be shown
(In reply to RamaKasturi from comment #9)
> I did a test by adding 50G disk to the vm and then writing 50GB data to the
> disk using dd command. I see that the dd operation is complete and following
> are the values i see in my mount point.
> [root@tettnang c880b513-ba36-4fbd-ae9a-593acc2e820a]# ls -lsah *
> 54G -rw-rw----. 1 vdsm kvm 50G May 11 17:28
> 1.0M -rw-rw----. 1 vdsm kvm 1.0M May 11 15:17
> 512 -rw-r--r--. 1 vdsm kvm 323 May 11 15:17
> Actual disk size is 50G where as, du -sh shows that it is 54G which is not
> supposed to be shown
Right, it is possible. Sorry I couldn't think of this earlier. If a write #n on a shard happens in a way that it creates holes in the file, and xfs preallocates some blocks, then these preallocated blocks are counted in block-count xattr. After sometime xfs releases these blocks. Then when a write #n+1 is sent on this region of the file that was previously preallocated and then deallocated, we end up counting these blocks again. So blocks in all the regions of a shard can be counted twice in the worst case, theoretically.
This bug should be in the sharding component and changing it accordingly