Bug 1648622
Summary: | [v2v] Migration performance regression | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ilan Zuckerman <izuckerm> | ||||
Component: | qemu-kvm-rhev | Assignee: | Maxim Levitsky <mlevitsk> | ||||
Status: | CLOSED ERRATA | QA Contact: | Tingting Mao <timao> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.6 | CC: | areis, bthurber, chayang, coli, dagur, dmetzger, eblake, fdupont, izuckerm, jferlan, jinzhao, jprause, juzhang, kwolf, mlevitsk, mrezanin, mtessun, mxie, nsoffer, rjones, tzheng, virt-maint, yuhuang, zili | ||||
Target Milestone: | rc | Keywords: | Performance, Regression, ZStream | ||||
Target Release: | 7.6 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-rhev-2.12.0-37.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1743322 (view as bug list) | Environment: | |||||
Last Closed: | 2020-03-31 14:34:48 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | V2V | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1743322 | ||||||
Attachments: |
|
Description
Ilan Zuckerman
2018-11-11 06:04:05 UTC
Based on what we see in the logs, the issue is behavior change in qemu-img 2.12. In 2.10 qemu-img was writing all data and zero parts of the image. In 2.12, qemu is first zeroing the entire device, and then write the data parts. In v2v scale tests, we test 100G images with 76G of data. So in this case qemu-img is writing 100G of zeros, and 76G of data, compared with 24G of zero and 76G of data in 2.10. Writing more zeros take much more time if writing zeros is not fast. The pipeline used by virt-v2v is: vmware -> nbdkit -> qemu-img convert -> nbdkit -> imageio -> image This can be tested in a simpler way using nbdkit (> 1.6), converting local file to nbdkit: $ virt-builder fedora-28 $ pvcreate /dev/mapper/xxxyyy $ vgcreate vg /dev/mapper/xxxyyy $ lvcreate -n lv -L 6G vg $ time /path/to/nbdkit file /dev/vg/lv \ --run 'qemu-img convert fedora-28.img -p -n $nbd' This mail thread give more details: http://lists.nongnu.org/archive/html/qemu-block/2018-11/msg00225.html Ilan, can you reproduce the issue using nbdkit as explained in comment 4? We need to compare the time with rhel 7.5 host (qemu-img 2.10), and rhel 7.6 (qemu-img 2.12), with same storage you used for v2v import. Nir updated me that he is in contact with qemu team so they can introduce the fix for what we suspect is the root cause for the issue explained in comment 4. Kevin, should we move this bug to qemu-kvm-rhev? I think it's not entirely clear yet what the solution will be and which components it will affect, but QEMU will certainly be one of them. So reassigning it to qemu-kvm-rhev should be okay. As this is NBD, I suppose Eric would be the right assignee. Based on comment 11, moving to qemu-kvm-rhev. Eric, can you take a look? Hi, I am Tingting from Virt-qe. According to the comment 4, I compared the covert time between 'qemu-kvm-rhev-2.10' and 'qemu-kvm-rhev-2.12'. However, the result is not matched to the bug.(i.e. The convert time seems longer on 'qemu-kvm-rhev-2.10'(23m28.830s) than on 'qemu-kvm-rhev-2.12'(22m19.212s).) Also, I noticed that after 'convert', the image file which is based on rbd is written with the *full* data in source image on 'qemu-kvm-rhev-2.10'[1], while for 'qemu-kvm-rhev-2.12', it is just written with the *actual* data in source image file[2]. Please let me know if I missed something, thanks. Reproduced steps: For ‘qemu-kvm-rhev-2.12’: 1. Check the version of ‘qemu-kvm-rhev’ # qemu-img --version qemu-img version 2.12.0 (qemu-kvm-rhev-2.12.0-14.el7) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers 2. Create source image # qemu-img create -f qcow2 base.qcow2 100G Formatting 'base.qcow2', fmt=qcow2 size=107374182400 cluster_size=65536 lazy_refcounts=off refcount_bits=16 3. Write data to it. # cat test.sh #!/bin/bash i=0 while [ $i -le 76 ] do qemu-io -c "write -P 1 ${i}G 1G" base.qcow2 i=$(expr $i + 1) done 4. Check the info of the image # qemu-img info base.qcow2 image: base.qcow2 file format: qcow2 virtual size: 100G (107374182400 bytes) disk size: 77G cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false 5. Prepare a image based on nbd # qemu-img create -f raw target.img 105G Formatting 'target.img', fmt=raw size=112742891520 # qemu-nbd -f raw target.img -p 8800 -t 6. Convert source image to target # time qemu-img convert -f qcow2 -O raw base.qcow2 nbd:localhost:8800 -p -n (100.00/100%) real 22m19.212s user 0m7.424s sys 1m9.403s 7. Check info of target file # qemu-img info target.img image: target.img file format: raw virtual size: 105G (112742891520 bytes) disk size: 77G -----------------------------------> [2] For ‘qemu-kvm-rhev-2.10’: 1. Downgrade the qemu version to 2.10 # qemu-img --version qemu-img version 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.7) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers 2. Prepare a nbd image file # qemu-img create -f raw target.img 105G Formatting 'target.img', fmt=raw size=112742891520 # qemu-nbd -f raw target.img -p 8900 -t 3. Convert the source image to another target # time qemu-img convert -f qcow2 -O raw base.qcow2 nbd:localhost:9000 -p -n (100.00/100%) real 23m28.830s user 0m6.626s sys 1m6.677s 4. Check info of target file # qemu-img info target.img image: target.img file format: raw virtual size: 105G (112742891520 bytes) disk size: 100G ----------------------------------------> [1] The regression is very specific to the backing file. It affects RHV because there we are using a Linux block device with very slow support for zeroing. If you just use a local file (supports fast, almost free zeroing) then you wouldn't notice any difference. Thanks for Richard's info. Reproduced this issue with the nbd's backing file based on block device like below, it indeed takes much more time on 'qemu-kvm-rhev-2.12'(50m34.078s) than 'qemu-kvm-rhev-2.10'(23m57.366s). Prepare one block disk: # pvcreate /dev/sdb # vgcreate lvtest /dev/sdb # lvcreate -L 105G -n target.img lvtest For ‘qemu-kvm-rhev-2.10’: # qemu-img --version qemu-img version 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.7) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers # qemu-img info base.img image: base.img file format: raw virtual size: 100G (107374182400 bytes) disk size: 76G # time qemu-img convert base.img nbd:localhost:9000 -p -n (100.00/100%) real 23m57.366s user 0m6.496s sys 1m10.790s For ‘qemu-kvm-rhev-2.12.0-14.el7’: # qemu-img --version qemu-img version 2.12.0 (qemu-kvm-rhev-2.12.0-14.el7) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers # time qemu-img convert base.img nbd:localhost:9000 -p -n (100.00/100%) real 50m34.078s user 0m6.760s sys 1m12.567s (In reply to Tingting Mao from comment #15) > Also, I noticed that after 'convert', the image file which is based on nbd > is written with the *full* data in source image on 'qemu-kvm-rhev-2.10'[1], > while for 'qemu-kvm-rhev-2.12', it is just written with the *actual* data in > source image file[2]. This looks like another bug in qemu-nbd on 2.10, not related to this bug. This bug is not related to qemu-nbd - we do not use it in the import process. It is better to check with nbdkit for reproducing and testing this bug, as explained in comment 4. However since you reproduce the issue very clearly I don't know if it worth the time to repeat the test, maybe only with small image (1g) just to see if we get the same picture. Tested this issue with 'nbdkit'. It's hard to find the time difference for 1G image, so I used 100G again. It still hints the bug listed below. Nbdkit for 1G image convert For ‘qemu-kvm-rhev-2.12’: # qemu-img info source.img image: source.img file format: raw virtual size: 1.0G (1073741824 bytes) disk size: 778M # time nbdkit file file=/dev/vgtest/target1.img -p 9000 --run 'qemu-img convert source.img -p -n nbd:localhost:9000' (100.00/100%) real 0m10.596s user 0m0.094s sys 0m0.524s For ‘qemu-kvm-rhev-2.10’: # qemu-img --version qemu-img version 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.7) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers # time nbdkit file file=/dev/vgtest/target2.img -p 9000 --run 'qemu-img convert source.img -p -n nbd:localhost:9000' (100.00/100%) real 0m10.434s user 0m0.110s sys 0m0.493s Nbdkit for 100G image convert For ‘qemu-kvm-rhev-2.10’: # qemu-img info base.img image: base.img file format: raw virtual size: 100G (107374182400 bytes) disk size: 76G # time nbdkit file file=/dev/vgtest/target.img -p 9000 --run 'qemu-img convert base.img -p -n nbd:localhost:9000' (100.00/100%) real 24m53.579s user 0m7.019s sys 1m12.853s For ‘qemu-kvm-rhev-2.12’: # time nbdkit file file=/dev/vgtest/target.img -p 9000 --run 'qemu-img convert base.img -p -n nbd:localhost:9000' (6.01/100%) (100.00/100%) real 42m25.667s user 0m7.594s sys 1m11.913s Which version of nbdkit was used in the test? Although it probably doesn't matter here since LVs cannot do efficient zeroing anyway, it's still worth pointing out that nbdkit 1.2 (RHEL 7.6) does not have the optimizations that Nir made for efficient zeroing, whereas nbdkit 1.8 (RHEL 7.7) does have them. See also: https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=66264 (In reply to Richard W.M. Jones from comment #20) > Which version of nbdkit was used in the test? Although it probably doesn't > matter here since LVs cannot do efficient zeroing anyway, it's still > worth pointing out that nbdkit 1.2 (RHEL 7.6) does not have the > optimizations that Nir made for efficient zeroing, whereas nbdkit 1.8 > (RHEL 7.7) does have them. See also: > https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=66264 It's nbdkit 1.2. # rpm -qa | grep nbdkit nbdkit-basic-plugins-1.2.7-2.el7.x86_64 nbdkit-1.2.7-2.el7.x86_64 # uname -r 3.10.0-944.el7.x86_64 This is surely the sort of thing we should be able to query through the NBD protocol. The current NBD_FLAG_ROTATIONAL flag seems most similar since it indicates a preference/optimization for the underlying block device. (In reply to Richard W.M. Jones from comment #29) > This is surely the sort of thing we should be able to query through the NBD > protocol. > The current NBD_FLAG_ROTATIONAL flag seems most similar since it indicates a > preference/optimization for the underlying block device. This looks like abuse of unrelated flag. I think fast write zeros on block based storage depends on the way the storage implements write zeroes. On XtremIO this seems to be always fast (50G/s). I guess it always deacllocate blocks without doing any I/O. On HPar we see 100G/s if the area was not allocated or 1G/s if it was allocated. So I think we can do: - Add FAST_WRITE_ZEROES flag to NBD protocol - Report this flag only for file based storage supporting fallocate() (NFS > 4.2, GlusterFS, XFS, ext4...) We don't have the information to set that flag. You have to try it and see whether it works. Which is exactly why I suggested extending the NBD protocol with a new operation instead of a capability flag. (In reply to Kevin Wolf from comment #32) How can you test if storage has fast write zeroes? When you use ioctl(BLKZEROOUT) or fallocate(ZERO_RANGE) they may succeed, but they may be fast or slow. Sounds like it is related (if not duplicate) of bug 1647104 (In reply to Kevin Wolf from comment #32) > We don't have the information to set that flag. You have to try it and see > whether it works. Which is exactly why I suggested extending the NBD > protocol with a new operation instead of a capability flag. NBD protocol addition proposed: https://lists.debian.org/nbd/2019/03/msg00004.html I will be implementing a proof of concept for it during qemu 4.1 phase to evaluate how much it helps. Created attachment 1556002 [details]
Tests results with upstream fix
This bug is fixed upstream by:
commit 1bd2e35c2992c679ef8b223153d47ffce76e7dc5
Merge: 905870b53c c6e3f520c8
Author: Peter Maydell <peter.maydell>
Date: Tue Mar 26 15:52:46 2019 +0000
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches:
- Fix slow pre-zeroing in qemu-img convert
- Test case for block job pausing on I/O errors
# gpg: Signature made Tue 26 Mar 2019 15:28:00 GMT
# gpg: using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
qemu-io: Add write -n for BDRV_REQ_NO_FALLBACK
qemu-img: Use BDRV_REQ_NO_FALLBACK for pre-zeroing
file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes
block: Advertise BDRV_REQ_NO_FALLBACK in filter drivers
block: Add BDRV_REQ_NO_FALLBACK
block: Remove error messages in bdrv_make_zero()
iotests: add 248: test resume mirror after auto pause on ENOSPC
Signed-off-by: Peter Maydell <peter.maydell>
I did benchmarks few weeks ago showing that with this change qemu-img is
up to 6.8 times faster when converting to qemu-nbd, and 1.75 faster when
converting to nbdkit (using file plugin).
I compared tested with:
- qemu-img-rhev-2.12.0-21
- qemu-4.0.0-rc0
- kevin slow zero fix on top of qemu-4.0.0-rc0
- kevin slow zero fix + my unmap fix
Here is summary of the results.
### Converting empty image
version server time change
===========================================================
Kevin slow zero fix nbdkit 0m5.660s +2399.0%
-----------------------------------------------------------
4.0.0-rc0 nbdkit 0m14.123s +951.5%
-----------------------------------------------------------
Kevin slow zero fix qemu-nbd 0m32.229s +415.0%
-----------------------------------------------------------
4.0.0-rc0 qemu-nbd 2m20.002s -4.4%
-----------------------------------------------------------
qemu-img-rhev-2.12.0-21 qemu-nbd 2m14.390s +0.0%
-----------------------------------------------------------
qemu-img-rhev-2.12.0-21 nbdkit error -
-----------------------------------------------------------
### Converting 75% full image
version server time change
===========================================================
Kevin slow zero fix nbdkit 0m15.510s +1005.5%
-----------------------------------------------------------
Kevin slow zero fix qemu-nbd 0m22.783s +680.0%
-----------------------------------------------------------
qemu-img-rhev-2.12.0-21 nbdkit 0m27.213s +573.1%
-----------------------------------------------------------
4.0.0-rc0 nbdkit 0m28.037s +556.2%
-----------------------------------------------------------
4.0.0-rc0 qemu-nbd 2m35.085s +0.0%
-----------------------------------------------------------
qemu-img-rhev-2.12.0-21 qemu-nbd 2m35.954s +0.0%
-----------------------------------------------------------
See the attachment for full details on how it was tested.
I posted V2 of the patches, with all the patches I send previosly (unmodified) and this commit backported (no conflicts thankfully) Best regards, Maxim Levitsky Verified this bug as below, the write performance both improved with qemu-nbd and nbdkit. So set this bug as verified. Thanks. Result: qemu-nbd: 0m31.627s ----> 0m21.189s nbdkit: 0m18.145s ---> 0m10.222s Tested with: 1. Nbdkit info # rpm -qa | grep nbdkit nbdkit-devel-1.8.0-3.el7.x86_64 nbdkit-debuginfo-1.8.0-3.el7.x86_64 nbdkit-1.8.0-3.el7.x86_64 nbdkit-plugin-python-common-1.8.0-3.el7.x86_64 nbdkit-example-plugins-1.8.0-3.el7.x86_64 nbdkit-plugin-python2-1.8.0-3.el7.x86_64 nbdkit-plugin-vddk-1.8.0-3.el7.x86_64 nbdkit-basic-plugins-1.8.0-3.el7.x86_64 2. Nvme disk info # lsblk | tail -n 3 nvme0n1 259:0 0 745.2G 0 disk ├─nvme0n1p1 259:1 0 105G 0 part └─nvme0n1p2 259:2 0 6G 0 part 3.The source image # dd if=/dev/urandom of=test.img bs=5M count=1024 # qemu-img info test.img image: test.img file format: raw virtual size: 5.0G (5368709120 bytes) disk size: 5.0G Steps: In ‘qemu-kvm-rhev-2.12.0-38.el7’ With qemu-nbd 1. Export the target nvme block over nbd # qemu-nbd -f raw /dev/nvme0n1p2 -p 9000 -t 2. Convert the image # time qemu-img convert test.img nbd:localhost:9000 -n -p (100.00/100%) real 0m21.189s user 0m1.003s sys 0m4.716s With nbdkit: # time nbdkit file file=/dev/nvme0n1p2 -p 9000 --run 'qemu-img convert test.img -p -n nbd:localhost:9000' (100.00/100%) real 0m10.222s user 0m0.438s sys 0m3.592s In ‘qemu-kvm-rhev-2.12.0-31.el7’ With qemu-nbd 1. Export the target nvme block over nbd # qemu-nbd -f raw /dev/nvme0n1p2 -p 9000 -t 2. Convert the image # time qemu-img convert test.img nbd:localhost:9000 -n -p (100.00/100%) real 0m31.627s user 0m2.243s sys 0m6.681s With nbdkit: # time nbdkit file file=/dev/nvme0n1p2 -p 9000 --run 'qemu-img convert test.img -p -n nbd:localhost:9000' (100.00/100%) real 0m18.145s user 0m0.465s sys 0m3.986s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:1216 |