Description of problem: qemu-img has an optimization to zero out the entire image before copying data. This optimization is useful when the destination storage support very fast zeroing (e.g. file), but when using block storage zero throughput is not predictable. It can be as fast as writing actual zero (e.g LIO) or 100 times faster (high end storage). When converting mostly empty images using pre-zero does not matter since zero is usually faster than writing zeroes. But when converting mostly full images, using pre-zero slows down the copy, in my tests up to 45% slower, but can be more for fully prellocated images. Here are few tests showing the issue. ## Test 1 - vm-based test environment and poor laptop storage. $ ./qemu-img info test.img image: test.img file format: raw virtual size: 10 GiB (10737418240 bytes) disk size: 8 GiB $ time ./qemu-img convert -f raw -O raw -t none -T none -W test.img /dev/test/lv1 With qemu-img master (9e3903136d9acde2fb2dd9e967ba928050a6cb4a) real 1m20.483s user 0m0.490s sys 0m0.739s With with patch[1] disabling pre-zero for block storage: real 0m55.831s user 0m0.610s sys 0m0.956s ## Test 2 - real server and storage Testing this LUN: # multipath -ll 3600a098038304437415d4b6a59684a52 dm-3 NETAPP,LUN C-Mode size=5.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 18:0:0:0 sdb 8:16 active ready running | `- 19:0:0:0 sdc 8:32 active ready running `-+- policy='service-time 0' prio=10 status=enabled |- 20:0:0:0 sdd 8:48 active ready running `- 21:0:0:0 sde 8:64 active ready running The destination is 100g logical volume on this LUN: # qemu-img info test-lv image: test-lv file format: raw virtual size: 100 GiB (107374182400 bytes) disk size: 0 B The source image is 100g image with 48g of data: # qemu-img info fedora-31-100g-50p.raw image: fedora-31-100g-50p.raw file format: raw virtual size: 100 GiB (107374182400 bytes) disk size: 48.4 GiB We can zero 2.3 g/s: # time blkdiscard -z test-lv real 0m43.902s user 0m0.002s sys 0m0.130s (I should really test with fallocate instead of blkdiscard, but the results look the same.) # iostat -xdm dm-3 5 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 20.80 301.40 0.98 2323.31 0.00 0.00 0.00 0.00 26.56 854.50 257.94 48.23 7893.41 0.73 23.58 dm-3 15.20 297.20 0.80 2321.67 0.00 0.00 0.00 0.00 26.43 836.06 248.72 53.80 7999.30 0.78 24.22 We can write 445m/s: # dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s # iostat -xdm dm-3 5 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 6.60 6910.00 0.39 431.85 0.00 0.00 0.00 0.00 2.48 2.70 15.19 60.73 64.00 0.14 98.84 dm-3 40.80 6682.60 1.59 417.61 0.00 0.00 0.00 0.00 1.71 2.73 14.92 40.00 63.99 0.15 97.60 dm-3 6.60 6887.40 0.39 430.46 0.00 0.00 0.00 0.00 2.15 2.66 14.92 60.73 64.00 0.14 98.22 Testing latest qemu-img: # rpm -q qemu-img qemu-img-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64 # time qemu-img convert -p -f raw -O raw -t none -W fedora-31-100g-50p.raw test-lv (100.00/100%) real 2m2.337s user 0m2.708s sys 0m17.326s # iostat -xdm dm-3 5 pre zero phase: Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 24.00 265.40 1.00 2123.20 0.00 0.00 0.00 0.00 36.81 543.52 144.99 42.48 8192.00 0.70 20.14 dm-3 9.60 283.60 0.59 2265.60 0.00 0.00 0.00 0.00 35.42 576.80 163.78 62.50 8180.44 0.70 20.58 dm-3 24.00 272.00 1.00 2176.00 0.00 0.00 0.00 0.00 22.89 512.40 139.77 42.48 8192.00 0.67 19.90 copy phase: Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 27.20 10671.20 1.19 655.84 0.00 0.00 0.00 0.00 2.70 10.99 111.98 44.83 62.93 0.09 96.74 dm-3 6.40 11537.00 0.39 712.33 0.00 0.00 0.00 0.00 3.00 11.90 131.52 62.50 63.23 0.08 97.82 dm-3 27.20 12400.20 1.19 765.47 0.00 0.00 0.00 0.00 3.60 11.16 132.31 44.83 63.21 0.08 95.50 dm-3 9.60 11312.60 0.59 698.20 0.00 0.20 0.00 0.00 3.73 11.69 126.64 63.00 63.20 0.09 97.70 Testing latest qemu-img + with patch[1] disabling pre-zero for block storage. # rpm -q qemu-img qemu-img-4.2.0-25.module+el8.2.1+6815+1c792dc8.nsoffer202006140516.x86_64 # time qemu-img convert -p -f raw -O raw -t none -W fedora-31-100g-50p.raw test-lv (100.00/100%) real 1m42.083s user 0m3.007s sys 0m18.735s # iostat -xdm dm-3 5 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util dm-3 6.60 7919.60 0.39 1136.67 0.00 0.00 0.00 0.00 14.70 15.32 117.43 60.73 146.97 0.10 77.84 dm-3 27.00 9065.00 1.19 571.38 0.00 0.20 0.00 0.00 2.52 14.64 128.21 45.13 64.54 0.11 97.46 dm-3 6.80 9467.40 0.40 814.75 0.00 0.00 0.00 0.00 2.74 12.15 110.25 60.82 88.12 0.10 90.46 dm-3 29.00 7713.20 1.32 996.48 0.00 0.40 0.00 0.01 5.40 14.48 107.98 46.60 132.29 0.11 83.76 dm-3 11.60 9661.60 0.70 703.54 0.00 0.40 0.00 0.00 2.26 11.22 103.56 61.72 74.57 0.10 97.98 dm-3 23.80 9639.20 0.99 696.82 0.00 0.00 0.00 0.00 1.98 11.54 106.49 42.80 74.03 0.10 93.68 dm-3 10.00 7184.60 0.60 1147.56 0.00 0.00 0.00 0.00 12.84 15.32 106.58 61.36 163.56 0.09 68.30 dm-3 35.00 6771.40 1.69 1293.37 0.00 0.00 0.00 0.00 17.44 18.06 119.48 49.58 195.59 0.10 66.46 Version-Release number of selected component (if applicable): qemu-img-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64 How reproducible: Always Steps to Reproduce: 1. Convert image to raw format on block device Actual results: Operation slower because of pre-zero. Expected results: Skip pre-zero for block device. Additional info: Qemu already disables pre-zero for block device in handle_aiocb_write_zeroes_block(), but since: commit a3d6ae2299eaab1bced05551d0a0abfbcd9d08d0 Author: Nir Soffer <nirsof> Date: Sun Mar 24 02:20:12 2019 +0200 qemu-img: Enable BDRV_REQ_MAY_UNMAP in convert Qemu is using handle_aiocb_write_zeroes_unmap(), assuming that fallocate() is fast since it uses BLKDEV_ZERO_NOFALLBACK in the kernel, so the kernel avoids slow manual zeroing. On RHEL 7 this worked fine, sine fallocate() did not support block devices with kernel 3.10. The call would always fail and we fall back to handle_aiocb_write_zeroes(), which calls handle_aiocb_write_zeroes_block() which return -ENOTSUP when called with QEMU_AIO_NO_FALLBACK. On RHEL 8.2 with kernel 4.18 fallocate() works not for block devices, exposing the issue. Even when using BLKDEV_ZERO_NOFALLBACK, and zeroing is faster than manually writing zeros, it is not fast enough to make pre-zero an optimization. We don't have a way to predict the zeroing throughput with block devices. Marked as high severity since this may cause performance issues for RHV customers when copying huge disks.
Posted fix upstream: https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00683.html
I found a similar bug: Bug 1722734 - 'qemu-img convert' images over RBD is very very slow. Could you please help check if they are the same issue. Not sure the backend of your block device. Many thanks.
(In reply to Xueqiang Wei from comment #3) > I found a similar bug: Bug 1722734 - 'qemu-img convert' images over RBD is > very very slow. > > Could you please help check if they are the same issue. Not sure the backend > of your block device. Many thanks. The backend in this case is iSCSI, not related to RBD. The RBD issues looks very different, 5 seconds with file vs 6 minutes with rbd, so I don't think this can be related to pre-zero.
In the case (which might not be here) where you know that the target device already contains zeroes, qemu-img convert recently added an option --target-is-zero.
(In reply to Richard W.M. Jones from comment #6) > In the case (which might not be here) where you know that the target > device already contains zeroes, qemu-img convert recently added an > option --target-is-zero. With block storage we don't have a way to know if a new logical volume is zeroed. There is no kernel interface reporting that since there is no way to report such thing from storage. The only case when we know that target is zeroed is when using new file on a file based storage. In this case zeroing is usually extremely fast (e.g. 500 GiB/s), but qemu does not have to zero anything because sparse file is already zeroed without extra work. The only case when zeroing is slow is NFS < 4.2, when using preallocated volume. In this case we already zeroed the image when creating it using qemu-img create -f raw -o preallocation=falloc If we create this volume in a copy volume flow, we run next: qemu-img convert ... -o preallocation=falloc /path/to/src-volume /path/to/dst-volume So it this case --target-is-zero would be handy to avoid the cost of zeroing the image twice. But our current code avoids the double zeroing by creating sparse file in the first step, and letting qemu-img convert allocate it. I think we can simplify this flow using --target-is-zero and maybe get slightly better performance, but I'm not sure the performance difference would be noticeable.
Reproduced it on 8.3.0 fast train. Versions: kernel-4.18.0-209.el8.x86_64 qemu-kvm-5.0.0-0.module+el8.3.0+6620+5d5e1420 1. create a 2G iamge # qemu-img create -f raw test.img 2G Formatting 'test.img', fmt=raw size=2147483648 2. covert it to full allocation # qemu-img convert -p -f raw -O raw -S 0 test.img test.img (100.00/100%) # qemu-img info test.img image: test.img file format: raw virtual size: 2 GiB (2147483648 bytes) disk size: 2 GiB 3. create lvm on block device # pvcreate /dev/mapper/mpatha # vgcreate vgtest /dev/mapper/mpatha # lvcreate -l 100%FREE -n lvtest vgtest # lvcreate -L 20G -n lvtest vgtest 4. convert the image to block device # time qemu-img convert -f raw -O raw -t none -T none -W /home/bug/test.img /dev/vgtest/lvtest -p (100.00/100%) real 15m8.734s user 0m0.558s sys 0m0.369s Tested with --target-is-zero, it get faster. # time qemu-img convert -f raw -O raw -t none -T none -W /home/bug/test.img /dev/vgtest/lvtest -p --target-is-zero -n (100.00/100%) real 0m12.757s user 0m0.558s sys 0m0.384s
(In reply to Xueqiang Wei from comment #10) > Tested with --target-is-zero, it get faster. Just to make sure, this is not the expected time for the improved version. --target-is-zero gives additional guarantees which mean that qemu-img has less work to do. Without it, qemu-img has to zero out blocks without data because otherwise the result could be corrupted. We'll never get this as fast as --target-is-zero.
(In reply to Kevin Wolf from comment #11) > (In reply to Xueqiang Wei from comment #10) > > Tested with --target-is-zero, it get faster. > > Just to make sure, this is not the expected time for the improved version. > --target-is-zero gives additional guarantees which mean that qemu-img has > less work to do. Without it, qemu-img has to zero out blocks without data > because otherwise the result could be corrupted. We'll never get this as > fast as --target-is-zero. Kevin, Thanks for your explanation. I have tested with improved version (8.2.1 scratch build), the result is in Comment 9. I just want to verify the parameter "--target-is-zero", it's really fast. Thanks.
This is fixed in qemu.git master as of commit edafc70c0c, which will be contained in 5.1.0-rc0.
(In reply to Kevin Wolf from comment #13) Thanks Kevin. can we backport it to 8.2.1? (see comment 8)
At this point, this means requesting a z-stream fix. The patch is simple, so I assume if you give an explanation why it's important to RHV to have this in 8.2.1 and request the z-stream flag, we should be able to get it approved.
(In reply to Kevin Wolf from comment #15) From RHV point of view, this is a performance regression compared to RHEL 7. In RHEL 7 fallocate(PUNCH_HOLE) used in the pre-zero step failed, and qemu-img failed back to ioctl(BLKZEROUT), which was disabled when using NOFALLBACK, so the entire prezero step was skipped. In RHEL 8, fallocate(PUNCH_HOLE) works for block devices - slowly - so the pre-zero step became a performance regression. Since this the difference can be significant (up to 45% in my tests), I think this worth a backport.
(In reply to Nir Soffer from comment #16) > (In reply to Kevin Wolf from comment #15) > From RHV point of view, this is a performance regression compared to RHEL 7. > > In RHEL 7 fallocate(PUNCH_HOLE) used in the pre-zero step failed, and > qemu-img > failed back to ioctl(BLKZEROUT), which was disabled when using NOFALLBACK, so > the entire prezero step was skipped. > > In RHEL 8, fallocate(PUNCH_HOLE) works for block devices - slowly - so the > pre-zero step became a performance regression. > > Since this the difference can be significant (up to 45% in my tests), I think > this worth a backport. Hi Nir, You mean backport the the fix to 8.2.1.z, right? Hi Kevin, Nir, This bug is for 8.3.0 fast train, I also hit it on 8.3.0 slow train. Do we need to clone it for 8.3.0 slow train? Versions: kernel-4.18.0-222.el8.x86_64 qemu-kvm-4.2.0-29.module+el8.3.0+7212+401047e6 1. create a 2G iamge # qemu-img create -f raw test.img 2G Formatting 'test.img', fmt=raw size=2147483648 2. covert it to full allocation # qemu-img convert -p -f raw -O raw -S 0 test.img test.img (100.00/100%) # qemu-img info test.img image: test.img file format: raw virtual size: 2 GiB (2147483648 bytes) disk size: 2 GiB 3. create lvm on block device # pvcreate /dev/mapper/mpatha # vgcreate vgtest /dev/mapper/mpatha # lvcreate -L 20G -n lvtest vgtest 4. convert the image to block device # time qemu-img convert -f raw -O raw -t none -T none -W /home/bug/test.img /dev/vgtest/lvtest -p (100.00/100%) real 14m17.389s user 0m0.570s sys 0m0.984s
(In reply to Xueqiang Wei from comment #17) > This bug is for 8.3.0 fast train, I also hit it on 8.3.0 slow train. Do we > need to clone it for 8.3.0 slow train? I think it's not as important on slow train. We can clone it, but it would probably be for 8.3.1.
(In reply to Kevin Wolf from comment #18) > (In reply to Xueqiang Wei from comment #17) > > This bug is for 8.3.0 fast train, I also hit it on 8.3.0 slow train. Do we > > need to clone it for 8.3.0 slow train? > > I think it's not as important on slow train. We can clone it, but it would > probably be for 8.3.1. Kevin, Thanks a lot. I will clone it first, in order to track the issue for slow train.
I think Kevin already answered.
Tested with qemu-kvm-5.1.0-2.module+el8.3.0+7652+b30e6901, not hit this issue. So set status to VERIFIED. Versions: kernel-4.18.0-227.el8.x86_64 qemu-kvm-5.1.0-2.module+el8.3.0+7652+b30e6901 1. create a 2G iamge # qemu-img create -f raw test.img 2G Formatting 'test.img', fmt=raw size=2147483648 2. covert it to full allocation # qemu-img convert -p -f raw -O raw -S 0 test.img test.img (100.00/100%) # qemu-img info test.img image: test.img file format: raw virtual size: 2 GiB (2147483648 bytes) disk size: 2 GiB 3. create lvm on block device # pvcreate /dev/mapper/mpatha # vgcreate vgtest /dev/mapper/mpatha # lvcreate -L 20G -n lvtest vgtest 4. convert the image to block device # time qemu-img convert -f raw -O raw -t none -T none -W /home/test.img /dev/vgtest/lvtest -p (100.00/100%) real 0m17.255s user 0m0.235s sys 0m0.633s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5137