Bug 1024202 - format a 16T size virtio-blk data disk in guest will take up so mush space
format a 16T size virtio-blk data disk in guest will take up so mush space
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: e2fsprogs (Show other bugs)
6.5
Unspecified Unspecified
medium Severity medium
: rc
: ---
Assigned To: Eric Sandeen
Filesystem QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-29 02:33 EDT by Sibiao Luo
Modified: 2013-10-31 12:42 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-10-30 15:20:43 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Sibiao Luo 2013-10-29 02:33:31 EDT
Description of problem:
attach a 16T virtio-blk data disk to guest and format it to EXT4 in guest, it will take up so mush space(more than 70G) in host, it over commit my host space and cause the qemu giving prompt that have no space left, this is very harm for users. 
BTW, I also tried the rhel7 guest in rhel7 host that did not meet such issue, it only take up about 1.4G space in host.

Version-Release number of selected component (if applicable):
host info:
# uname -r && rpm -q qemu-kvm
2.6.32-425.el6.x86_64
qemu-kvm-0.12.1.2-2.415.el6.x86_64
guest info:
2.6.32-425.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1.qemu-img create a 16T qcow2 disk in host.
# df -h
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/vg_pq20-lv_root   50G  7.0G   40G  15% /
tmpfs                        505G     0  505G   0% /dev/shm
/dev/sda1                    485M   69M  391M  15% /boot
/dev/mapper/vg_pq20-lv_home   81G   12G   66G  15% /home
# qemu-img create -f qcow2 my16T.qcow2 16T
Formatting 'my16T.qcow2', fmt=qcow2 size=17592186044416 encryption=off cluster_size=65536 
# qemu-img info my16T.qcow2 
image: my16T.qcow2
file format: qcow2
virtual size: 16T (17592186044416 bytes)
disk size: 456K
cluster_size: 65536
2.attach the 16T image to guest as a data disk via virtio-blk interface.
e.g:...-drive file=/home/my16T.qcow2,if=none,id=drive-data-disk,format=qcow2,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,bus=pci.0,addr=0x7,drive=drive-data-disk,id=data-disk 
3.format the disk to EXT4 in guest.

Actual results:
after step 3, fail to complete format and guest hang there.
guest ]# mkfs.ext4 /dev/vdb
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1073741824 inodes, 4294967295 blocks
214748364 blocks (5.00%) reserved for the super user
First data block=0
131072 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 
	2560000000, 3855122432

Writing inode tables:  35736/131072

(qemu) info status 
VM status: running
(qemu) info block
drive-virtio-disk: removable=0 io-status=ok file=/home/RHEL-6.5-Snapshot-4-Server-x86_64.qcow2 ro=0 drv=qcow2 encrypted=0
drive-data-disk: removable=0 io-status=ok file=/home/my16T.qcow2 ro=0 drv=qcow2 encrypted=0
ide1-cd0: removable=1 locked=0 tray-open=0 io-status=ok [not inserted]
floppy0: removable=1 locked=0 tray-open=0 [not inserted]
sd0: removable=1 locked=0 tray-open=0 [not inserted]
(qemu) 
(qemu) block I/O error in device 'drive-data-disk': No space left on device (28)
block I/O error in device 'drive-data-disk': No space left on device (28)
block I/O error in device 'drive-data-disk': No space left on device (28)
block I/O error in device 'drive-data-disk': No space left on device (28)
block I/O error in device 'drive-data-disk': No space left on device (28)
block I/O error in device 'drive-data-disk': No space left on device (28)
block I/O error in device 'drive-data-disk': No space left on device (28)
block I/O error in device 'drive-data-disk': No space left on device (28)
...
block I/O error in device 'drive-data-disk': No space left on device (28)
block I/O error in device 'drive-data-disk': No space left on device (28)

(qemu) 
(qemu) info status 
VM status: paused (io-error)

host ]# qemu-img info my16T.qcow2
image: my16T.qcow2
file format: qcow2
virtual size: 16T (17592186044416 bytes)
disk size: 70G
cluster_size: 65536
host ]# ls -lh my16T.qcow2 
-rw-r--r--. 1 root root 70G Oct 28 22:26 my16T.qcow2

host ]# df -h
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/vg_pq20-lv_root   50G  7.0G   40G  15% /
tmpfs                        505G  4.0K  505G   1% /dev/shm
/dev/sda1                    485M   69M  391M  15% /boot
/dev/mapper/vg_pq20-lv_home   81G   81G     0 100% /home

Expected results:
It should complete to format in guest and not to take up so mush space(more than 70G) in host.

Additional info:
# /usr/libexec/qemu-kvm -M pc -S -cpu host -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -no-kvm-pit-reinjection -usb -device usb-tablet,id=input0 -name sluo -uuid 990ea161-6b67-47b2-b803-19fb01d30d30 -rtc base=localtime,clock=host,driftfix=slew -device virtio-serial-pci,id=virtio-serial0,max_ports=16,vectors=0,bus=pci.0,addr=0x3 -chardev socket,id=channel1,path=/tmp/helloworld1,server,nowait -device virtserialport,chardev=channel1,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port1 -chardev socket,id=channel2,path=/tmp/helloworld2,server,nowait -device virtserialport,chardev=channel2,name=com.redhat.rhevm.vdsm,bus=virtio-serial0.0,id=port2 -drive file=/home/RHEL-6.5-Snapshot-4-Server-x86_64.qcow2,if=none,id=drive-virtio-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop -device virtio-blk-pci,vectors=0,bus=pci.0,addr=0x4,scsi=off,drive=drive-virtio-disk,id=virtio-disk,bootindex=1 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=00:01:02:B6:40:21,bus=pci.0,addr=0x5 -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x6 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -drive file=/home/my16T.qcow2,if=none,id=drive-data-disk,format=qcow2,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,bus=pci.0,addr=0x7,drive=drive-data-disk,id=data-disk -k en-us -boot menu=on -qmp tcp:0:4444,server,nowait -serial unix:/tmp/ttyS0,server,nowait -vnc :1 -spice disable-ticketing,port=5931 -monitor stdio
Comment 1 Sibiao Luo 2013-10-29 02:38:17 EDT
Aslo tried rhel7 host and guest, the result as following:
host info:
# uname -r && rpm -q qemu-kvm
3.10.0-37.el7.x86_64
qemu-kvm-1.5.3-10.el7.x86_64
guest info:
3.10.0-37.el7.x86_64

- before format:
# qemu-img create -f qcow2 my16t.qcow2 16T
Formatting 'my16t.qcow2', fmt=qcow2 size=17592186044416 encryption=off cluster_size=65536 lazy_refcounts=off 
# qemu-img info my16t.qcow2 
image: my16t.qcow2
file format: qcow2
virtual size: 16T (17592186044416 bytes)
disk size: 388K
cluster_size: 65536
# ls -lh my16t.qcow2 
-rw-r--r--. 1 root root 448K Oct 29 13:01 my16t.qcow2

e.g:...-drive file=/home/my16t.qcow2,if=none,id=drive-data-disk,format=qcow2,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,bus=pci.0,addr=0x7,scsi=off,drive=drive-data-disk,id=data-disk

guest ]# mkfs.ext4 /dev/vdb
mke2fs 1.42.8 (20-Jun-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
268435456 inodes, 4294967296 blocks
214748364 blocks (5.00%) reserved for the super user
First data block=0
131072 block groups
32768 blocks per group, 32768 fragments per group
2048 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 
	2560000000, 3855122432

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done         

You have new mail in /var/spool/mail/root

- after format:
# ls -lh my16t.qcow2 
-rw-r--r--. 1 root root 1.4G Oct 29 14:26 my16t.qcow2
# qemu-img info my16t.qcow2 
image: my16t.qcow2
file format: qcow2
virtual size: 16T (17592186044416 bytes)
disk size: 1.3G
cluster_size: 65536

Best Regards,
sluo
Comment 2 Sibiao Luo 2013-10-29 03:14:42 EDT
Also tried rhel6.4GA qemu-kvm-0.12.1.2-2.355.el6.x86_64 version that also hit the same issue, so this bug is not a regression.
host info:
# uname -r && rpm -q qemu-kvm
2.6.32-425.el6.x86_64
qemu-kvm-0.12.1.2-2.355.el6.x86_64

Best Regards,
sluo
Comment 3 Paolo Bonzini 2013-10-29 05:29:58 EDT
Please try RHEL7 host on RHEL6 guest, and RHEL6 guest on RHEL7 host.

- If it works with RHEL7 host on RHEL6 guest, move to e2fsprogs

- If it works with RHEL6 guest on RHEL7 host, qemu-kvm is the right component

- If both work, you can leave it as qemu-kvm.
Comment 4 Sibiao Luo 2013-10-30 05:24:12 EDT
(In reply to Paolo Bonzini from comment #3)
> Please try RHEL7 host on RHEL6 guest, and RHEL6 guest on RHEL7 host.
Thanks for your kindly help, i has done as your instruction. 
> - If it works with RHEL7 host on RHEL6 guest, move to e2fsprogs
> - If it works with RHEL6 guest on RHEL7 host, qemu-kvm is the right component
> 
> - If both work, you can leave it as qemu-kvm.

 *rhel7 host + rhel6.5 guest ------ hit the issue.

 *rhel6.5 host + rhel7 guest ------ no such issue.

Paolo Bonzini, could you help move this bug to the right compponent, thanks in advance.

Best Regards,
sluo
Comment 5 Paolo Bonzini 2013-10-30 08:48:52 EDT
qemu-kvm is the right component.
Comment 6 Paolo Bonzini 2013-10-30 14:53:27 EDT
RHEL6 host, RHEL6 guest - 70 G (comment 0)
RHEL7 host, RHEL6 guest - 70 G (comment 4)
RHEL6 host, RHEL7 guest - 1.4 G (comment 4)
RHEL7 host, RHEL7 guest - 1.4 G (comment 0)

So this is guest-dependent.
Comment 7 Eric Sandeen 2013-10-30 15:20:43 EDT
Newer e2fsprogs uses lazy inode table initialization so it doesn't have to write 70G (eyeroll) of zeros at mkfs time.  This can be turned on in RHEL6 e2fsprogs as well with the "-E lazy_itable_init=1" option:

# truncate --size=16t fsfile
# rpm -q e2fsprogs
e2fsprogs-1.41.12-18.el6.x86_64
# mkfs.ext4 -E lazy_itable_init=1 fsfile 
mke2fs 1.41.12 (17-May-2010)
...
# du -h fsfile
261M	fsfile

indeed, it only writes 261M of data at mkfs time.

However this is just deferred until mount time, when they will eventually get written to:

# mount -o loop fsfile mnt
# while sleep 30; do du -h fsfile; done
333M	fsfile
373M	fsfile
411M	fsfile
...


so you *will* wind up with 70G allocated eventually, w/ the RHEL7 client & newer e2fsprogs as well.

It's a "feature."

If you don't want that to happen, you have 2 choices:

1) Don't create a 16T extN filesystem, if you don't want huge amounts of pre-allocated, static metadata, or

2) Use a filesystem like XFS which doesn't behave this way:

# truncate --size=16t fsfile
# mkfs.xfs fsfile
...
# du -h fsfile
2.0G	fsfile
Comment 8 Sibiao Luo 2013-10-31 01:08:14 EDT
(In reply to Eric Sandeen from comment #7)
> Newer e2fsprogs uses lazy inode table initialization so it doesn't have to
> write 70G (eyeroll) of zeros at mkfs time.  This can be turned on in RHEL6
> e2fsprogs as well with the "-E lazy_itable_init=1" option:
> 
> # truncate --size=16t fsfile
> # rpm -q e2fsprogs
> e2fsprogs-1.41.12-18.el6.x86_64
> # mkfs.ext4 -E lazy_itable_init=1 fsfile 
> mke2fs 1.41.12 (17-May-2010)
> ...
> # du -h fsfile
> 261M	fsfile
> 
> indeed, it only writes 261M of data at mkfs time.
> 
> However this is just deferred until mount time, when they will eventually
> get written to:
> 
> # mount -o loop fsfile mnt
> # while sleep 30; do du -h fsfile; done
> 333M	fsfile
> 373M	fsfile
> 411M	fsfile
> ...
> 
Thanks for your good explains.
> so you *will* wind up with 70G allocated eventually, w/ the RHEL7 client &
> newer e2fsprogs as well.
> 
> It's a "feature."
> 
> If you don't want that to happen, you have 2 choices:
> 
> 1) Don't create a 16T extN filesystem, if you don't want huge amounts of
> pre-allocated, static metadata, or
> 
> 2) Use a filesystem like XFS which doesn't behave this way:
> 
> # truncate --size=16t fsfile
> # mkfs.xfs fsfile
> ...
> # du -h fsfile
> 2.0G	fsfile
But I don't think it does solve the ( orvery real) problem, why does users must to not create 16T or use XFS? As QEMU has support 16T disk, so we should make all the file system work well i think.
And it does a bugging here, why just close it NOTABUG to sole it? Could you close it CANFIX if it's unfixable, and we can make notice to hightlight it in our wiki page according it. Please correct me if any mistake, thanks.

Best regards,
sluo
Comment 9 Paolo Bonzini 2013-10-31 09:40:56 EDT
> But I don't think it does solve the ( orvery real) problem, why does users 
> must to not create 16T or use XFS? As QEMU has support 16T disk, so we should > make all the file system work well i think.

70G is 0.4% of the size of the file system you're using.  If ext4 uses that much space for metadata, there's nothing we can do about it.  XFS is simply smarter.

I think what you're looking for is automatic detection of zero writes.  QEMU even has the code already to do that, but it is expensive and almost never triggers so I am not sure it is something we should have by default.  Perhaps we could look at enabling that for allocating writes, which would noticeably speedup "dd if=/dev/zero of=/dev/vda" with qcow2 (O_o).  But for raw images, this is certainly desired behavior.

Kevin, if what I said in the last paragraph makes sense to you, can you open a new bug for it?
Comment 10 Eric Sandeen 2013-10-31 11:34:13 EDT
> And it does a bugging here, why just close it NOTABUG to sole it? Could you close it CANFIX if it's unfixable

Because it is not a bug.  As Paolo says, using 0.4% of the filesystem for metadata is hardly a "bug" - it is the design of ext4.

And your requirements are arbitrary.  A huge 16T filesystem is "necessary" but 70G is "too much?"  What about 500T/20G?  Or 2T/500G?  What is the "non-buggy" ratio?


That being said, Paolo reminded me of something else.   I thought we had something in mke2fs which, instead of writing zeros, we could issue a discard IFF the device supports it, and if it returns 0 for discarded blocks.

commit 6fcd6f84c235f4bf2bd9770f172837da9982eb6e
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Fri Aug 20 16:41:14 2010 -0500

    mke2fs: use lazy inode init on some discard-able devices
    
    If a device supports discard -and- returns 0s for discarded blocks,
    then we can skip the inode table initialization -and- the inode table
    zeroing at mkfs time, and skip the lazy init as well since they are
    already zeroed out.

That hit e2fsprogs v1.41.13.  So if the device appears to support discard, and will return 0s (as indicated by the BLKDISCARDZEROES ioctl), the newer e2fsprogs shouldn't write the 0s at mkfs time or post-mount.  (I need to double check this, the code looks a bit weird).

But I still don't think that solves the problem in general; as the filesystem gets used, each new block group that gets written to will start using the bitmaps, and bring them back into "real" storage.  It may be slower than the itable init thread, but sufficient use of the filesystem will eventually bring arbitrarily large amounts of metadata "online."
Comment 11 Kevin Wolf 2013-10-31 12:04:43 EDT
(In reply to Paolo Bonzini from comment #9)
> Kevin, if what I said in the last paragraph makes sense to you, can you open
> a new bug for it?

I'm not sure if zero detection would be worth it. For one, as you already said,
we can't enable it by default because it impacts performance. Doing it only for
allocations means that it can't be done in the generic block layer any more and
that users will understand even less than today what is happening.

In the end, I also think it would be something between optimising the wrong
thing (special code in qcow2 for mke2fs, so that the empty filesystem takes
less space and grows only a bit later? really?) and harmful (preallocation can
be seen as a feature and we would defeat it).
Comment 12 Paolo Bonzini 2013-10-31 12:13:02 EDT
I guess we have 3 votes that this is not a bug now. :)

BTW, QEMU disks never have BLKDISCARDZEROES, even in RHEL7 where discard is supported.  This is because BLKDISCARDZEROES would prevent moving the system from a backing storage with zero-on-discard to a backing storage without.  You only get BLKDISCARDZEROES when accessing a SCSI disk directly with no emulation.
Comment 13 Kevin Wolf 2013-10-31 12:25:44 EDT
We could make it work on qcow2 images, there it doesn't depend on the backing
storage.
Comment 14 Paolo Bonzini 2013-10-31 12:42:52 EDT
qcow2 _is_ backing storage already. :) You still want to move qcow2 to raw and back without changing guest ABI.

We do expose a property that maps to BLKDISCARDZEROES in the guest (called discard_zeroes) but there's no error checking and you're not really supposed to use it except for testing.

Note You need to log in before you can comment on or make changes to this bug.