Bug 1024202
Summary: | format a 16T size virtio-blk data disk in guest will take up so mush space | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Sibiao Luo <sluo> |
Component: | e2fsprogs | Assignee: | Eric Sandeen <esandeen> |
Status: | CLOSED NOTABUG | QA Contact: | Filesystem QE <fs-qe> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6.5 | CC: | acathrow, bsarathy, chayang, esandeen, juzhang, kwolf, michen, mkenneth, pbonzini, qzhang, sct, virt-maint, xfu |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-10-30 19:20:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sibiao Luo
2013-10-29 06:33:31 UTC
Aslo tried rhel7 host and guest, the result as following: host info: # uname -r && rpm -q qemu-kvm 3.10.0-37.el7.x86_64 qemu-kvm-1.5.3-10.el7.x86_64 guest info: 3.10.0-37.el7.x86_64 - before format: # qemu-img create -f qcow2 my16t.qcow2 16T Formatting 'my16t.qcow2', fmt=qcow2 size=17592186044416 encryption=off cluster_size=65536 lazy_refcounts=off # qemu-img info my16t.qcow2 image: my16t.qcow2 file format: qcow2 virtual size: 16T (17592186044416 bytes) disk size: 388K cluster_size: 65536 # ls -lh my16t.qcow2 -rw-r--r--. 1 root root 448K Oct 29 13:01 my16t.qcow2 e.g:...-drive file=/home/my16t.qcow2,if=none,id=drive-data-disk,format=qcow2,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,bus=pci.0,addr=0x7,scsi=off,drive=drive-data-disk,id=data-disk guest ]# mkfs.ext4 /dev/vdb mke2fs 1.42.8 (20-Jun-2013) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 268435456 inodes, 4294967296 blocks 214748364 blocks (5.00%) reserved for the super user First data block=0 131072 block groups 32768 blocks per group, 32768 fragments per group 2048 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 2560000000, 3855122432 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done You have new mail in /var/spool/mail/root - after format: # ls -lh my16t.qcow2 -rw-r--r--. 1 root root 1.4G Oct 29 14:26 my16t.qcow2 # qemu-img info my16t.qcow2 image: my16t.qcow2 file format: qcow2 virtual size: 16T (17592186044416 bytes) disk size: 1.3G cluster_size: 65536 Best Regards, sluo Also tried rhel6.4GA qemu-kvm-0.12.1.2-2.355.el6.x86_64 version that also hit the same issue, so this bug is not a regression. host info: # uname -r && rpm -q qemu-kvm 2.6.32-425.el6.x86_64 qemu-kvm-0.12.1.2-2.355.el6.x86_64 Best Regards, sluo Please try RHEL7 host on RHEL6 guest, and RHEL6 guest on RHEL7 host. - If it works with RHEL7 host on RHEL6 guest, move to e2fsprogs - If it works with RHEL6 guest on RHEL7 host, qemu-kvm is the right component - If both work, you can leave it as qemu-kvm. (In reply to Paolo Bonzini from comment #3) > Please try RHEL7 host on RHEL6 guest, and RHEL6 guest on RHEL7 host. Thanks for your kindly help, i has done as your instruction. > - If it works with RHEL7 host on RHEL6 guest, move to e2fsprogs > - If it works with RHEL6 guest on RHEL7 host, qemu-kvm is the right component > > - If both work, you can leave it as qemu-kvm. *rhel7 host + rhel6.5 guest ------ hit the issue. *rhel6.5 host + rhel7 guest ------ no such issue. Paolo Bonzini, could you help move this bug to the right compponent, thanks in advance. Best Regards, sluo qemu-kvm is the right component. RHEL6 host, RHEL6 guest - 70 G (comment 0) RHEL7 host, RHEL6 guest - 70 G (comment 4) RHEL6 host, RHEL7 guest - 1.4 G (comment 4) RHEL7 host, RHEL7 guest - 1.4 G (comment 0) So this is guest-dependent. Newer e2fsprogs uses lazy inode table initialization so it doesn't have to write 70G (eyeroll) of zeros at mkfs time. This can be turned on in RHEL6 e2fsprogs as well with the "-E lazy_itable_init=1" option: # truncate --size=16t fsfile # rpm -q e2fsprogs e2fsprogs-1.41.12-18.el6.x86_64 # mkfs.ext4 -E lazy_itable_init=1 fsfile mke2fs 1.41.12 (17-May-2010) ... # du -h fsfile 261M fsfile indeed, it only writes 261M of data at mkfs time. However this is just deferred until mount time, when they will eventually get written to: # mount -o loop fsfile mnt # while sleep 30; do du -h fsfile; done 333M fsfile 373M fsfile 411M fsfile ... so you *will* wind up with 70G allocated eventually, w/ the RHEL7 client & newer e2fsprogs as well. It's a "feature." If you don't want that to happen, you have 2 choices: 1) Don't create a 16T extN filesystem, if you don't want huge amounts of pre-allocated, static metadata, or 2) Use a filesystem like XFS which doesn't behave this way: # truncate --size=16t fsfile # mkfs.xfs fsfile ... # du -h fsfile 2.0G fsfile (In reply to Eric Sandeen from comment #7) > Newer e2fsprogs uses lazy inode table initialization so it doesn't have to > write 70G (eyeroll) of zeros at mkfs time. This can be turned on in RHEL6 > e2fsprogs as well with the "-E lazy_itable_init=1" option: > > # truncate --size=16t fsfile > # rpm -q e2fsprogs > e2fsprogs-1.41.12-18.el6.x86_64 > # mkfs.ext4 -E lazy_itable_init=1 fsfile > mke2fs 1.41.12 (17-May-2010) > ... > # du -h fsfile > 261M fsfile > > indeed, it only writes 261M of data at mkfs time. > > However this is just deferred until mount time, when they will eventually > get written to: > > # mount -o loop fsfile mnt > # while sleep 30; do du -h fsfile; done > 333M fsfile > 373M fsfile > 411M fsfile > ... > Thanks for your good explains. > so you *will* wind up with 70G allocated eventually, w/ the RHEL7 client & > newer e2fsprogs as well. > > It's a "feature." > > If you don't want that to happen, you have 2 choices: > > 1) Don't create a 16T extN filesystem, if you don't want huge amounts of > pre-allocated, static metadata, or > > 2) Use a filesystem like XFS which doesn't behave this way: > > # truncate --size=16t fsfile > # mkfs.xfs fsfile > ... > # du -h fsfile > 2.0G fsfile But I don't think it does solve the ( orvery real) problem, why does users must to not create 16T or use XFS? As QEMU has support 16T disk, so we should make all the file system work well i think. And it does a bugging here, why just close it NOTABUG to sole it? Could you close it CANFIX if it's unfixable, and we can make notice to hightlight it in our wiki page according it. Please correct me if any mistake, thanks. Best regards, sluo > But I don't think it does solve the ( orvery real) problem, why does users
> must to not create 16T or use XFS? As QEMU has support 16T disk, so we should > make all the file system work well i think.
70G is 0.4% of the size of the file system you're using. If ext4 uses that much space for metadata, there's nothing we can do about it. XFS is simply smarter.
I think what you're looking for is automatic detection of zero writes. QEMU even has the code already to do that, but it is expensive and almost never triggers so I am not sure it is something we should have by default. Perhaps we could look at enabling that for allocating writes, which would noticeably speedup "dd if=/dev/zero of=/dev/vda" with qcow2 (O_o). But for raw images, this is certainly desired behavior.
Kevin, if what I said in the last paragraph makes sense to you, can you open a new bug for it?
> And it does a bugging here, why just close it NOTABUG to sole it? Could you close it CANFIX if it's unfixable
Because it is not a bug. As Paolo says, using 0.4% of the filesystem for metadata is hardly a "bug" - it is the design of ext4.
And your requirements are arbitrary. A huge 16T filesystem is "necessary" but 70G is "too much?" What about 500T/20G? Or 2T/500G? What is the "non-buggy" ratio?
That being said, Paolo reminded me of something else. I thought we had something in mke2fs which, instead of writing zeros, we could issue a discard IFF the device supports it, and if it returns 0 for discarded blocks.
commit 6fcd6f84c235f4bf2bd9770f172837da9982eb6e
Author: Eric Sandeen <sandeen>
Date: Fri Aug 20 16:41:14 2010 -0500
mke2fs: use lazy inode init on some discard-able devices
If a device supports discard -and- returns 0s for discarded blocks,
then we can skip the inode table initialization -and- the inode table
zeroing at mkfs time, and skip the lazy init as well since they are
already zeroed out.
That hit e2fsprogs v1.41.13. So if the device appears to support discard, and will return 0s (as indicated by the BLKDISCARDZEROES ioctl), the newer e2fsprogs shouldn't write the 0s at mkfs time or post-mount. (I need to double check this, the code looks a bit weird).
But I still don't think that solves the problem in general; as the filesystem gets used, each new block group that gets written to will start using the bitmaps, and bring them back into "real" storage. It may be slower than the itable init thread, but sufficient use of the filesystem will eventually bring arbitrarily large amounts of metadata "online."
(In reply to Paolo Bonzini from comment #9) > Kevin, if what I said in the last paragraph makes sense to you, can you open > a new bug for it? I'm not sure if zero detection would be worth it. For one, as you already said, we can't enable it by default because it impacts performance. Doing it only for allocations means that it can't be done in the generic block layer any more and that users will understand even less than today what is happening. In the end, I also think it would be something between optimising the wrong thing (special code in qcow2 for mke2fs, so that the empty filesystem takes less space and grows only a bit later? really?) and harmful (preallocation can be seen as a feature and we would defeat it). I guess we have 3 votes that this is not a bug now. :) BTW, QEMU disks never have BLKDISCARDZEROES, even in RHEL7 where discard is supported. This is because BLKDISCARDZEROES would prevent moving the system from a backing storage with zero-on-discard to a backing storage without. You only get BLKDISCARDZEROES when accessing a SCSI disk directly with no emulation. We could make it work on qcow2 images, there it doesn't depend on the backing storage. qcow2 _is_ backing storage already. :) You still want to move qcow2 to raw and back without changing guest ABI. We do expose a property that maps to BLKDISCARDZEROES in the guest (called discard_zeroes) but there's no error checking and you're not really supposed to use it except for testing. |