Hide Forgot
Description of problem: When creating a qcow2 volume with preallocation, it's expected to use only as much disk space as it needs. It does so on NFS, but on block devices (such as iSCSI) it uses the entire size of the file, because it writes to the last byte. Version-Release number of selected component (if applicable): 0.13.0 How reproducible: Solid. Steps to Reproduce: lvcreate --autobackup n --contiguous n --size 1024m --name bd43e02a-6424-4f1e-9caa-30f796304e37 d576a2c8-8b5e-43e5-93c9-f710931874e9 qemu-img create -f qcow2 -o preallocation=metadata /dev/d576a2c8-8b5e-43e5-93c9-f710931874e9/bd43e02a-6424-4f1e-9caa-30f79630 4e37 20971520K (on a volume too small to hold that much data) Actual results: Formatting '/dev/d576a2c8-8b5e-43e5-93c9-f710931874e9/bd43e02a-6424-4f1e-9caa-30f7963 04e37', fmt=qcow2 size=21474836480 encryption=off cluster_size=0 preallocation='metadata' /dev/d576a2c8-8b5e-43e5-93c9-f710931874e9/bd43e02a-6424-4f1e-9caa-30f79630 4e37: error while creating qcow2: No space left on device Expected results: Formatting '/dev/d576a2c8-8b5e-43e5-93c9-f710931874e9/bd43e02a-6424-4f1e-9caa-30f7963 04e37', fmt=qcow2 size=21474836480 encryption=off cluster_size=0 Additional info: <kwolf> Hm, yes, this is known <kwolf> I wonder though if it's necessary on block devices <erez> if it's necessary to preallocate metadata? <kwolf> On file systems we must increase the file size (but can leave it sparse), but with block devices things could look different <kwolf> During preallocation, qcow2 does a write to the very last cluster allocated <erez> without this feature we don't have thin provisioning.. <erez> I see <erez> Why does it do it? To validate that it exists? <kwolf> Because otherwise reads would access space after the EOF <kwolf> Which fails <kwolf> On block devices, this doesn't matter, obviously <kwolf> Or does it? <kwolf> Probably it does. Hm. <erez> (danken) that's unfortunate... we won't be able to use it on block devices, which is the only place where this matters <erez> but why does it matter that all addresses exist in the block device? <erez> if qemu accesses a nonexisting one, it will block on ENOSPC anyway, right? <kwolf> The problem is not with writes, but with reads <kwolf> I don't remember the details, though <kwolf> Maybe we could fix the read function to return zeros (as it's supposed to work)
Can you please test the performance of qcow2 w/ latest 6.1 code? We have lots of changes there and we might not need prealocation cmdline
No response, closing, if I'm not mistaken it might not be possible on raw devices.
Reopening. If you want to test qcow2 performance on 6.1/6.2 you should ask qemu qe or performance team, not rhevm qe / engineering. Indeed, you cannot preallocate the clusters on iscsi, but you can preallocate the tables so that all md will be located sequentially on disk.
> Indeed, you cannot preallocate the clusters on iscsi, but you can preallocate > the tables so that all md will be located sequentially on disk. Also, you could preallocate according to device size. Could be nice actually to even do this during runtime (every time we make the device bigger, allocate mappings to added sections... Btw Dor, even if performance improved in 6.2, it does not mean that preallocating would not improve performance even further which would make this bz worthwhile regardless of current qcow2 performance. Note that Kevin said in kvm forum that qcow2 causes a 50% performance degradation...
What's your use case for this? If I'm not mistaken, formatting the image with a filesystem will already write to sectors close to the end to the image, so if this was implemented, you would only move the LV growth from image creation time to installation time. Which I guess doesn't make a big difference.
Had a short email conversation with Ayal about this. For the reasons stated in comment 10 it's obvious that preallocating clusters isn't possible in this case. The open question was whether preallocating the L2 tables would make sense, and if we could take advantage of having all of them sequentially at the start of the image. We came to the conclusion that because we never read two tables at once, having them sequential doesn't help; and in the allocating case the cost of one additional 64k write for the new L2 table for each 512 MB of virtual disk size (assuming 64k clusters) would likely be lost in the noise, so an optimisation for this would be a wasted effort. There's much more to gain in other places. Therefore, I'm closing the bug again.