Description of problem: Conversion performance is not good when convert guest by modular virt-v2v Version-Release number of selected component (if applicable): virt-v2v-1.45.96-1.el9.x86_64 libguestfs-1.46.1-2.el9.x86_64 guestfs-tools-1.46.1-6.el9.x86_64 nbdkit-server-1.28.4-1.el9.x86_64 libvirt-libs-8.0.0-0rc1.1.el9.x86_64 qemu-img-6.2.0-3.el9.x86_64 virtio-win-1.9.19-5.el9_b.noarch python3-ovirt-engine-sdk4-4.4.15-1.el9ev.x86_64 rhv-4.4.10.2-0.1.el8ev How reproducible: 80% Steps to Reproduce: 1. Convert 4 guests from different VMware via vddk+rhv-upload at the same time to compare the performance between virt-v2v-1.45.96-1 and virt-v2v-1.45.3-3 #virt-v2v -ic vpx://root@vcenter_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/$vddk -io vddk-thumbprint=B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78 -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -op /home/rhvpasswd -os nfs_data -b ovirtmgmt $guest v2v_version ESXi7.0+vddk7.0.2 ESXi7.0+vddk6.7 ESXi6.7+vddk7.0.2 ESXi6.7+vddk6.7 1.45.96-1 Convert guest: 1m2s Convert guest: 3m32s Convert guest: 1m5s Convert guest: 5m2s Copying disk: 41m45s Copying disk: 36m10s Copying disk: 40m2s Copying disk: 36m19s 1.45.3-3 Convert guest: 1m8s Convert guest: 3m7s Convert guest: 1m8s Convert guest: 4m13s Copying disk: 10m47s Copying disk: 9m58s Copying disk: 9m45s Copying disk: 9m59s 2. Convert guests with below ways at the same time to compare the performance between virt-v2v-1.45.96-1 and virt-v2v-1.45.3-3 #virt-v2v -ic vpx://root@vcenter_ip/data/esxi_host/?no_verify=1 -ip /home/passwd -o json -os /home $guest #virt-v2v -i vmx -it ssh ssh://root@esxi_host/vmfs/volumes/esx6.7-matrix/esx6.7-rhel8.4-x86_64/esx6.7-rhel8.4-x86_64.vmx -ip /home/passwd -o local -os /home ----------------------------------------------------------------------------- v2v_version rhv_to_json vmx+ssh_to_local 1.45.96-1 Convert guest: 16m46s Convert guest: 1m22s Copying disk: 13m30s Copying disk: 50s 1.45.3-3 Convert guest: 9m48s Convert guest: 1m10s Copying disk: 8m20s Copying disk: 2m9s ------------------------------------------------------------------------------ Actual results: (1)Converting performance of virt-v2v-1.45.96-1 is almost same with virt-v2v-1.45.3-3 when convert guest from vmware via vddk and vmx+ssh (2)Converting performance of virt-v2v-1.45.96-1 is not as good as virt-v2v-1.45.3-3 when convert guest from vmware without vddk (3)Copying disk performance of virt-v2v-1.45.96-1 is not as good as virt-v2v-1.45.3-3 when convert guest to rhv via rhv-upload (4)Copying disk performance of virt-v2v-1.45.96-1 is better than virt-v2v-1.45.3-3 only when target is local (5)Copying disk performance of virt-v2v-1.45.96-1 is not as good as virt-v2v-1.45.3-3 only when target is json Expected results: Conversion performance of modular virt-v2v is good Additional info: Can't test output '-o rhv' because of bug2027598
FWIW some upstream analysis: https://listman.redhat.com/archives/libguestfs/2022-January/thread.html#00055 https://listman.redhat.com/archives/libguestfs/2022-January/thread.html#00057 https://listman.redhat.com/archives/libguestfs/2022-January/thread.html#00058
So one thing that came out of the upstream analysis is that modular virt-v2v always flushes the data out to disk, whereas old virt-v2v did not do that. As a result, to get fair comparisons you *must* do a "sync" after virt-v2v, and include the time taken for sync in the total time. For example: $ virt-v2v -i ... -o ... $ time sync real 0m48.795s user 0m0.000s sys 0m0.062s And add 48 seconds to the total time. It probaby won't close the gap given the large differences shown in comment 0.
I'm not sure if we have a specific bug for -i disk -> -o rhv-upload, but I'm finally able to reproduce the slow down in this case locally, and it is very clear and reproducible. $ time ./run virt-v2v -i disk fedora-35.qcow2 -o rhv-upload -oc https://ovirt4410/ovirt-engine/api -op /tmp/ovirt-passwd -oo rhv-direct -os ovirt-data -on test3 Virt-v2v 1.44.2: 1m22 Virt-v2v 1.45.97@18b11018: 2m20 The test guest is the standard performance guest described here: https://listman.redhat.com/archives/libguestfs/2022-January/msg00055.html
Not much clue what's going on here, but I posted some questions upstream: https://listman.redhat.com/archives/libguestfs/2022-February/msg00109.html
It turns out that we were tickling a 60 second timeout in the oVirt code because modular virt-v2v changed the order in which some operations were done (in a neutral way - this is really a problem in oVirt). Long story is here: https://listman.redhat.com/archives/libguestfs/2022-February/thread.html#00111 I have a small patch which restores performance so it's now about the same as old virt-v2v within measurement error (sometimes a bit faster). Virt-v2v 1.44.2: $ time ./run virt-v2v -i disk /var/tmp/fedora-35.qcow2 -o rhv-upload -oc https://ovirt4410.home.annexia.org/ovirt-engine/api -op /tmp/ovirt-passwd -oo rhv-direct -os ovirt-data -on test11 -of raw [ 0.4] Opening the source -i disk /var/tmp/fedora-35.qcow2 [ 0.5] Creating an overlay to protect the source from being modified [ 0.5] Opening the overlay [ 8.3] Inspecting the overlay [ 10.9] Checking for sufficient free disk space in the guest [ 10.9] Estimating space required on target for each disk [ 10.9] Converting Fedora Linux 35 (Thirty Five) to run on KVM virt-v2v: warning: /files/boot/grub2/device.map/hd0 references unknown device "vda". You may have to fix this entry manually after conversion. virt-v2v: This guest has virtio drivers installed. [ 36.9] Mapping filesystem data to avoid copying unused and blank areas [ 37.7] Closing the overlay [ 38.4] Assigning disks to buses [ 38.4] Checking if the guest needs BIOS or UEFI to boot [ 38.4] Initializing the target -o rhv-upload -oc https://ovirt4410.home.annexia.org/ovirt-engine/api -op /tmp/ovirt-passwd -os ovirt-data [ 39.7] Copying disk 1/1 to qemu URI json:{ "file.driver": "nbd", "file.path": "/run/user/1000/v2vnbdkit.kQgYQb/nbdkit3.sock", "file.export": "/" } (raw) (100.00/100%) [ 72.8] Creating output metadata [ 73.3] Finishing off real 1m13.644s user 0m1.832s sys 0m4.845s Virt-v2v 1.45.97 + patch: $ time ./run virt-v2v -i disk /var/tmp/fedora-35.qcow2 -o rhv-upload -oc https://ovirt4410.home.annexia.org/ovirt-engine/api -op /tmp/ovirt-passwd -oo rhv-direct -os ovirt-data -on test10 -of raw [ 0.0] Setting up the source: -i disk /var/tmp/fedora-35.qcow2 [ 1.0] Opening the source [ 9.0] Inspecting the source [ 11.7] Checking for sufficient free disk space in the guest [ 11.7] Converting Fedora Linux 35 (Thirty Five) to run on KVM virt-v2v: warning: /files/boot/grub2/device.map/hd0 references unknown device "vda". You may have to fix this entry manually after conversion. virt-v2v: This guest has virtio drivers installed. [ 38.1] Mapping filesystem data to avoid copying unused and blank areas [ 39.1] Closing the overlay [ 39.7] Assigning disks to buses [ 39.7] Checking if the guest needs BIOS or UEFI to boot [ 39.7] Setting up the destination: -o rhv-upload -oc https://ovirt4410.home.annexia.org/ovirt-engine/api -os ovirt-data [ 59.5] Copying disk 1/1 █ 100% [****************************************] [ 71.2] Creating output metadata [ 74.1] Finishing off real 1m14.365s user 0m8.183s sys 0m13.769s The patch is here: https://listman.redhat.com/archives/libguestfs/2022-February/msg00128.html
This patch is now upstream in: https://github.com/libguestfs/virt-v2v/commit/d69ba56b2f4bc642ce59bfc6bdd5c137480bf8c3 I have a further question for Nir and it may be possible to gain significantly more performance, so I'm going to leave this bug in POST for now: https://listman.redhat.com/archives/libguestfs/2022-February/msg00130.html
I've created an upstream oVirt bug to discuss the slow disk creation problem: bug 2053103
Refer to comment 9,move the bug back to ASSIGNED.
(In reply to Richard W.M. Jones from comment #3) > So one thing that came out of the upstream analysis is that modular > virt-v2v always flushes the data out to disk, whereas old virt-v2v > did not do that. Old virt-v2v used qemu-img convert, and it always flushes at the end. It would be a bad if it didn't. Here is an example: $ nbdkit -f -v file file=dst.raw 2>&1 | grep flush nbdkit: file[1]: debug: file: can_flush nbdkit: file.0: debug: file: flush $ qemu-img convert -n -W fedora-35.raw nbd://localhost flush was called. When using old virt-v2v, we always had 2 flushes in imageio logs. One flush came from qemu-img, the other one from nbdkit (bug). > As a result, to get fair comparisons you *must* do a "sync" after > virt-v2v, and include the time taken for sync in the total time. For > example: > > $ virt-v2v -i ... -o ... > $ time sync > > real 0m48.795s > user 0m0.000s > sys 0m0.062s > > And add 48 seconds to the total time. This should not be needed to compare times.
(In reply to mxie from comment #9) I think we need logs to understand whats going on. I'm not sure about the flows involving vddk, but the the local import to rhv, it wil help if you attach here the v2v log and imageio logs from the host performing the import.
(In reply to Nir Soffer from comment #11) > (In reply to Richard W.M. Jones from comment #3) > > So one thing that came out of the upstream analysis is that modular > > virt-v2v always flushes the data out to disk, whereas old virt-v2v > > did not do that. > > Old virt-v2v used qemu-img convert, and it always flushes at the end. It > would be > a bad if it didn't. > > Here is an example: > > $ nbdkit -f -v file file=dst.raw 2>&1 | grep flush > nbdkit: file[1]: debug: file: can_flush > nbdkit: file.0: debug: file: flush > > $ qemu-img convert -n -W fedora-35.raw nbd://localhost > > flush was called. This isn't always true. Old virt-v2v in modes such as -o local actually did something like this (ie. writing directly to the output file): $ qemu-img convert -n -W overlay.qcow2 guest.img If you strace the qemu-img command you'll see it doesn't fsync the output. You can also try this with virt-v2v 1.44.2: $ sync; ./run virt-v2v -i disk /var/tmp/fedora-35.qcow2 -o local -os /var/tmp/ -of raw; time sync and you'll see the final sync command takes a few seconds (depending on the size of the input and speed of the disk). I think qemu-img convert behaves differently if the output is an NBD server. It appears if the server advertises flush then it will send it at the end. > When using old virt-v2v, we always had 2 flushes in imageio logs. One flush > came from qemu-img, the other one from nbdkit (bug). > > > As a result, to get fair comparisons you *must* do a "sync" after > > virt-v2v, and include the time taken for sync in the total time. For > > example: > > > > $ virt-v2v -i ... -o ... > > $ time sync > > > > real 0m48.795s > > user 0m0.000s > > sys 0m0.062s > > > > And add 48 seconds to the total time. > > This should not be needed to compare times. Maybe not for -o rhv-upload, but it definitely is for other outputs. (In reply to Nir Soffer from comment #12) > (In reply to mxie from comment #9) > I think we need logs to understand whats going on. I'm not sure about the > flows > involving vddk, but the the local import to rhv, it wil help if you attach > here the v2v log and imageio logs from the host performing the import. Yes I'd like to see the logs too.
(In reply to Richard W.M. Jones from comment #13) > (In reply to Nir Soffer from comment #11) > > (In reply to Richard W.M. Jones from comment #3) > > > So one thing that came out of the upstream analysis is that modular > > > virt-v2v always flushes the data out to disk, whereas old virt-v2v > > > did not do that. > > > > Old virt-v2v used qemu-img convert, and it always flushes at the end. It > > would be > > a bad if it didn't. > > > > Here is an example: > > > > $ nbdkit -f -v file file=dst.raw 2>&1 | grep flush > > nbdkit: file[1]: debug: file: can_flush > > nbdkit: file.0: debug: file: flush > > > > $ qemu-img convert -n -W fedora-35.raw nbd://localhost > > > > flush was called. > > This isn't always true. Old virt-v2v in modes such as -o local actually > did something like this (ie. writing directly to the output file): > > $ qemu-img convert -n -W overlay.qcow2 guest.img > > If you strace the qemu-img command you'll see it doesn't fsync the output. Yes, this very bad: $ strace -f -e fdatasync qemu-img convert fedora-35.raw dst.raw 2>&1 | grep fdatasync $ strace -f -e fdatasync qemu-img convert -t unsafe fedora-35.raw dst.raw 2>&1 | grep fdatasync Fortunately RHV always use -t none so we always have a flush: $ strace -f -e fdatasync qemu-img convert -t writeback fedora-35.raw dst.raw 2>&1 | grep fdatasync [pid 30951] fdatasync(8) = 0 $ strace -f -e fdatasync qemu-img convert -t none fedora-35.raw dst.raw 2>&1 | grep fdatasync [pid 31187] fdatasync(8) = 0 Kevin, avoiding flushes during the copy makes sense, but flushing at the end sounds like a better default to me for the use case of qemu-img convert. Should we file qemu-img bug for this?
(In reply to mxie from comment #14) > (In reply to Nir Soffer from comment #12) > > (In reply to mxie from comment #9) > > I think we need logs to understand whats going on. I'm not sure about the > > flows > > involving vddk, but the the local import to rhv, it wil help if you attach > > here the v2v log and imageio logs from the host performing the import. > > All v2v conversions of comment9 are executed on standalone v2v server rather > than rhv node and v2v debugs logs of comment9 are in > http://fileshare.englab.nay.redhat.com/pub/section3/libvirtmanual/mxie/pre- > verify-bug2039255/ Are you using -oo rhv-direct=true when using -o rhv-upload? If you don't every request to imageio server go via the imageio proxy on engine host. This is typically 50%s slower compare with sending to directly to the host. With small requests (moduular virt-v2v use 256k instead of 2m), this create huge overhead that may explain the slow down. This is also *not* the recommended usage and testing this in context of performance testing is not good use of our time. When testing performance should always use: virt-v2v -o rhv-upload -oo rhv-direct=true And we should run virt-v2v on a RHV hypervisor node (not on the manager node). Testing on another host which is not part of the cluster is nice to have. Testing without -oo rhv-direct is nice to have for completeness, but I don't think we should spend time on this use case.
> Are you using -oo rhv-direct=true when using -o rhv-upload? > If you don't every request to imageio server go via the imageio proxy on > engine host. This is typically 50%s slower compare with sending to directly > to the host. > > With small requests (moduular virt-v2v use 256k instead of 2m), this create > huge overhead that may explain the slow down. > > This is also *not* the recommended usage and testing this in context of > performance testing is not good use of our time. > > When testing performance should always use: > > virt-v2v -o rhv-upload -oo rhv-direct=true Hi Richard, do you think it's necessary to add -oo rhv-direct=true for rhv-upload conversions to retest the step1 of commnet9? > And we should run virt-v2v on a RHV hypervisor node (not on the manager > node). > Testing on another host which is not part of the cluster is nice to have. The v2v version to be tested is rhel9 build but rhv4.4 node is based on rhel8, so can't run v2v on rhv node
My testing is using 4 separate machines: VMware ESXi ----> virt-v2v ----> oVirt host (& oVirt Engine separately) Fedora 36 RHEL 8.5 RHEL 8.5 I'm _not_ using rhv-direct, because I thought that this only works when you run virt-v2v on the oVirt host. Also for the same reason as mxie above, I cannot test recent virt-v2v on RHEL 8. I'll also note: https://bugzilla.redhat.com/show_bug.cgi?id=2033096
I should say that when comparing virt-v2v 1.44 and virt-v2v 1.45.98, I'm running both on the same Fedora 36 / Rawhide. So even if using the rhv proxy is bad, I'm not sure it can be the cause of the problem.
Sorry, I accidentally cancelled the needinfo set for Kevin on comment 15.
(In reply to Richard W.M. Jones from comment #18) > My testing is using 4 separate machines: > > VMware ESXi ----> virt-v2v ----> oVirt host (& oVirt Engine separately) > Fedora 36 RHEL 8.5 RHEL 8.5 > > I'm _not_ using rhv-direct, because I thought that this only works when you > run virt-v2v on the oVirt host. Also for the same reason as mxie above, > I cannot test recent virt-v2v on RHEL 8. -oo rhv-direct=true works and should be the default in virt-v2v to avoid this confusion.
(In reply to Richard W.M. Jones from comment #19) > I should say that when comparing virt-v2v 1.44 and virt-v2v 1.45.98, > I'm running both on the same Fedora 36 / Rawhide. So even if using > the rhv proxy is bad, I'm not sure it can be the cause of the problem. I think amplify the problem - for every nbd command, we: 1. send http request to the proxy 2. the proxy send http request to the host 3. the host send nbd command to qemu-nbd 4. the host return reply to the proxy 5. the proxy return reply to us I see 1.8x speed up when using local virt-v2v, communicating with imageio via unix socke by increasing request size from 256k to 4m. When working with remove server via a proxy, the overhead for each request is much larger. Of course there may be an issue on the input side, which changed a lot, but comment 9 show clear issue on the output side: > v2v_version local_to_rhv_upload > > 1.45.98-1 Convert guest: 2m49s > Copying disk: 7m57s > > > 1.45.3-3 Convert guest: 2m29s > Copying disk: 3m51s
(In reply to Nir Soffer from comment #21) > -oo rhv-direct=true works and should be the default in virt-v2v to avoid this > confusion. I just realised that I _am_ in fact using -oo rhv-direct! It was hiding in the very long (6 line) virt-v2v command I'm using. If it works, then yes we should use it, and as you can see from bug 2033096 the aim is to make that the default. Out of interest I just reran my tests with and without -oo rhv-direct to see if it could account for the difference. Results below for me. These are all local vddk -> -o rhv-upload, using VDDK 7.0.3 and ESXi 7: -oo rhv-direct=true -oo rhv-direct=false virt-v2v 1.45.98 5m15 5m19 virt-v2v 1.45.3 6m11 5m51 virt-v2v 1.44.2 5m11 4m59 At the moment I cannot reliably reproduce this bug.
(In reply to Richard W.M. Jones from comment #23) > Out of interest I just reran my tests with and without -oo rhv-direct > to see if it could account for the difference. Results below for me. > These are all local vddk -> -o rhv-upload, using VDDK 7.0.3 and ESXi 7: > > -oo rhv-direct=true -oo rhv-direct=false > virt-v2v 1.45.98 5m15 5m19 > virt-v2v 1.45.3 6m11 5m51 > virt-v2v 1.44.2 5m11 4m59 The timing look identical - looks like -oo rhv-direct is broken. We either always use direct more or never. Do you have imageio logs for these tests? We see the client address in the CLOSE log: 2022-02-14 19:28:00,215 INFO (Thread-125) [http] CLOSE connection=125 client=::ffff:192.168.122.10 ...
With -oo rhv-direct=true: client=local client=local client=::ffff:192.168.0.139 client=local client=local client=::ffff:192.168.0.139 client=::ffff:192.168.0.139 client=::ffff:192.168.0.139 client=::ffff:192.168.0.139 client=local (192.168.0.139 == IP address of machine running virt-v2v) With -oo rhv-direct=false: client=local client=local client=::ffff:192.168.0.210 client=local client=local client=local client=::ffff:192.168.0.210 client=::ffff:192.168.0.210 client=::ffff:192.168.0.210 client=::ffff:192.168.0.210 (192.168.0.210 == IP address of oVirt engine)
(In reply to Richard W.M. Jones from comment #26) Your test looks correct. So this show that the bottleneck is the input side - if you send data slow enough, it does not matter if you use the proxy or not. Because nbdcopy uses async read and writes, the reads are never blocked by slow writes. In imageio client we have 4 threads, but every one is block either on read or on write, so slow write slows also reading from source. This is what I see on my test system, using oVirt 4.5, uploading local file (fedora 35 + 3g of random data) from my laptop to oVirt system running as vms on the laptop. I tested these combinations: # combination seconds -------------------------------------------------------------- 1 Simulating virt-v2v[1] request size 256k 16.98 2 Simulating virt-v2v[1] request size 4m 9.33 3 Simulating virt-v2v[1] via proxy, request size 256k 30.60 4 Simulating virt-v2v[1] via proxy, request size 4m 15.21 5 Normal upload[2] read size 4m 8.49 6 Normal upload[2] read size 256k 7.77 7 Normal upload[2] via proxy, read size 4m 14.68 8 Normal upload[2] via proxy, read size 256k 12.63 [1] detecting zeroes and sending small (256k) http requests. Send one HTTP request per one NBD read. [2] The normal upload uses read size only for reading from qemu-nbd, sending one HTTP request per extent (splitting large extents to 128m chunks). You can see that there is a huge difference between sending one request per nbd read and one request per extent, and using the proxy is much slower. Also it the advantage of larger request size is very clear.
(In reply to Nir Soffer from comment #15) > Kevin, avoiding flushes during the copy makes sense, but flushing at the end > sounds like a better default to me for the use case of qemu-img convert. Should we > file qemu-img bug for this? I'm not sure about the right behaviour there, it completely depends on what you're going to do with the copy next. As this is with 'cache=unsafe', I think it makes sense to optimise for the shortest runtime of qemu-img and avoid the flush - it might be just a temporary file that you continue processing from the page cache and flushing would only be unnecessary overhead. If you do want to have the image safe on disk, you can always call 'sync' on the file next.
Verify the bug with below builds: virt-v2v-1.45.99-1.el9.x86_64 libguestfs-1.46.1-2.el9.x86_64 guestfs-tools-1.46.1-6.el9.x86_64 libvirt-libs-8.0.0-4.el9.x86_64 qemu-img-6.2.0-8.el9.x86_64 nbdkit-1.28.5-1.el9.x86_64 libnbd-1.10.5-1.el9.x86_64 virtio-win-1.9.19-5.el9_b.noarch Steps: 1. Convert three different guests from ESXi7.0 to rhv(rhv-upload) via different version vddk at the same time to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3, besides, considering add -oo rhv-direct/rhv-proxy option to command line when convert guest with different v2v version #virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk7.0.2 -io vddk-thumbprint=xx:xx:xx:... -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -op /home/rhvpasswd -os nfs_data -b ovirtmgmt esx7.0-win2022-x86_64 (-oo rhv-direct/rhv-proxy=true) #virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.7 -io vddk-thumbprint=xx:xx:xx:... -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -op /home/rhvpasswd -os nfs_data -b ovirtmgmt esx7.0-rhel8.5-x86_64 (-oo rhv-direct/rhv-proxy=true) #virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.5 -io vddk-thumbprint=xx:xx:xx:... -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -op /home/rhvpasswd -os nfs_data -b ovirtmgmt esx7.0-win11-x86_64 (-oo rhv-direct/rhv-proxy=true) ------------------------------------------------------------------------------------------------- v2v_version ESXi7.0+vddk7.0.2 ESXi7.0+vddk6.7 ESXi7.0+vddk6.5 1.45.99-1 Convert guest: 57s Convert guest: 2m25s Convert guest: 46s Copying disk: 9m44s Copying disk: 10m4s Copying disk: 15m57s 1.45.3-3(rhv-direct=true) Convert guest: 55s Convert guest: 2m19s Convert guest: 1m Copying disk: 9m30s Copying disk: 9m25s Copying disk: 12m48s -------------------------------------------------------------------------------------------------- v2v_version ESXi7.0+vddk7.0.2 ESXi7.0+vddk6.7 ESXi7.0+vddk6.5 1.45.99-1(rhv-proxy=true) Convert guest: 55s Convert guest: 2m26s Convert guest: 55s Copying disk: 10m33s Copying disk: 10m33s Copying disk: 16m7s 1.45.3-3 Convert guest: 59s Convert guest: 2m14s Convert guest: 54s Copying disk: 10m26s Copying disk: 10m17s Copying disk: 13m40s Hi Richard, According to the current test results, the performance of virt-v2v-1.45.99-1 has been greatly improved, which is basically similar to that of virt-v2v-1.45.3-3, but you can see that the performance of virt-v2v-1.45.99-1 is still a little worse than that of the old version virt-v2v-1.45.3-3 when vddk version is 7.0.2 and 6.7,the performance gap is obvious when vddk version is 6.5, do you think the performance difference is acceptable? If yes, I will continue to test the other scenarios to compare their performance.
Thanks for the testing. Some broader points first: - I'm only really concerned about VDDK 7 & ESXi 7. By the time people are really using this in RHEL 9 (probably 9.1) they will have upgraded to both. - We only need to do performance testing with -oo rhv-direct (which is now the default). -oo rhv-proxy is designed for situations where you don't have direct network access to the storage, and those are always going to be slow because it has to go through a proxy. - By the way, -oo rhv-direct[=true] is the default so it's no longer needed. -oo rhv-proxy[=true] is the opposite (use a proxy, slow). I think the performance numbers look fine now. It's still a few % slower, but modular virt-v2v is more capable for a few reasons: - modular virt-v2v will allow an external program to do the copying, which means that whole system performance will be better (eventually, once we implement this fully) - nbdcopy doesn't trash the page cache when copying to local, again a benefit to whole system performance that's not visible for single copies > the performance gap is obvious when vddk version is 6.5 VDDK 6.5 didn't support extents, so it'll end up copying much more data. If customers ever hit this case we'll tell them to upgrade to the latest VDDK, which is usually a simple thing to do.
As VDDK 6.5 didn't support extents, regardless of its impact on virt-v2v-1.45.99-1 performance, continue to verify the bug with below builds: virt-v2v-1.45.99-1.el9.x86_64 libguestfs-1.46.1-2.el9.x86_64 guestfs-tools-1.46.1-6.el9.x86_64 libvirt-libs-8.0.0-4.el9.x86_64 qemu-img-6.2.0-8.el9.x86_64 nbdkit-1.28.5-1.el9.x86_64 libnbd-1.10.5-1.el9.x86_64 virtio-win-1.9.19-5.el9_b.noarch Steps: 1. Convert two different guests from ESXi6.7 to rhv(rhv-upload) via vddk7.0.2 and vddk6.7 at the same time to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3 #virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk7.0.2 -io vddk-thumbprint=xx:xx:xx:... -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -op /home/rhvpasswd -os nfs_data -b ovirtmgmt esx7.0-win2022-x86_64 #virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.7 -io vddk-thumbprint=xx:xx:xx:... -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -op /home/rhvpasswd -os nfs_data -b ovirtmgmt esx7.0-rhel8.5-x86_64 ----------------------------------------------------------------------------- v2v_version ESXi6.7+vddk7.0.2 ESXi6.7+vddk6.7 1.45.99-1 Convert guest: 1m1s Convert guest: 2m31s Copying disk: 9m44s Copying disk: 10m19s 1.45.3-3(rhv-direct=true) Convert guest: 1m2s Convert guest: 2m23s Copying disk: 7m34s Copying disk: 7m36s ----------------------------------------------------------------------------- v2v_version ESXi6.7+vddk7.0.2 ESXi6.7+vddk6.7 1.45.99-1(rhv-proxy=true) Convert guest: 51s Convert guest: 2m26s Copying disk: 10m4s Copying disk: 10m41s 1.45.3-3 Convert guest: 55s Convert guest: 2m14s Copying disk: 7m40s Copying disk: 7m51s ----------------------------------------------------------------------------- 2. Convert two different guests from ESXi6.5 to rhv(rhv-upload) via vddk7.0.2 and vddk6.7 at the same time to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3 #virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk7.0.2 -io vddk-thumbprint=xx:xx:xx:... -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -op /home/rhvpasswd -os nfs_data -b ovirtmgmt esx7.0-win2022-x86_64 #virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.7 -io vddk-thumbprint=xx:xx:xx:... -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -op /home/rhvpasswd -os nfs_data -b ovirtmgmt esx7.0-rhel8.5-x86_64 ----------------------------------------------------------------------------- v2v_version ESXi6.5+vddk7.0.2 ESXi6.5+vddk6.7 1.45.99-1 Convert guest: 55s Convert guest: 4m2s Copying disk: 12m37s Copying disk: 33m39s 1.45.3-3(rhv-direct=true) Convert guest: 3m35s Convert guest: 5m17s Copying disk: 10m12s Copying disk: 21m26s ----------------------------------------------------------------------------- v2v_version ESXi6.5+vddk7.0.2 ESXi6.5+vddk6.7 1.45.99-1(rhv-proxy=true) Convert guest: 53s Convert guest: 2m28s Copying disk: 12m58s Copying disk: 34m1s 1.45.3-3 Convert guest: 3m41s Convert guest: 4m50s Copying disk: 10m34s Copying disk: 21m20s ----------------------------------------------------------------------------- 3.Convert guests in below ways at the same time to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3 3.1 #virt-v2v -ic vpx://root.198.169/data/10.73.199.217/?no_verify=1 -ip /home/passwd -o local -os /home esx7.0-win2019-x86_64 #virt-v2v -i vmx -it ssh ssh://root.75.219/vmfs/volumes/esx6.7-matrix/esx6.7-rhel8.5-x86_64/esx6.7-rhel8.5-x86_64.vmx -ip /home/passwd -o rhv -os 10.73.195.48:/home/nfs_export ----------------------------------------------------------------------------- v2v_version VMware_to_local vmx+ssh_to_rhv 1.45.99-1 Convert guest: 22m31s Convert guest: 2m3s Copying disk: 74m49s Copying disk: 6m38s 1.45.3-3 Convert guest: 10m25s Convert guest: 2m Copying disk: 42m41s Copying disk: 5m15s ----------------------------------------------------------------------------- 3.2 The performance gap between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3 is too large in step3.1, so convert another guest from VMware to local without vddk to compare their performance again #virt-v2v -ic vpx://root.198.169/data/10.73.199.217/?no_verify=1 -ip /home/passwd -o local -os /home esx7.0-sles15sp2-x86_64 ----------------------------------------------------------------------------- v2v_version VMware_to_local 1.45.99-1 Convert guest: 13m38s Copying disk: 13m17s 1.45.3-3 Convert guest: 9m55s Copying disk: 8m13s ----------------------------------------------------------------------------- 4.Convert a guest from ova file to openstack to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3 #virt-v2v -i ova /media/tools/ova-images/esx7.0-win2019-x86_64 -o openstack -oo server-id=v2v-appliance ----------------------------------------------------------------------------- v2v_version ova_to_openstack 1.45.99-1 Convert guest: 3m59s Copying disk: 6m53s 1.45.3-3 Convert guest: 3m51s Copying disk: 7m5s ----------------------------------------------------------------------------- Hi Richard, (1)Please check the result of step1, the performance of virt-v2v-1.45.99-1 is a little worse than that of virt-v2v-1.45.3-3 when convert guest from ESXi6.7 to rhv(rhv-upload) with vddk7.0.2 and vddk6.7,,their performance gap is almost between 2 and 3 minutes. (2)Please check the result of step2, when convert guest from ESXi6.5 to rhv(rhv-upload) with vddk the performance of virt-v2v-1.45.99-1 is basically similar to that of virt-v2v-1.45.3-3 if vddk version is 7.0.2,but the performance of virt-v2v-1.45.99-1 is much worse than that of virt-v2v-1.45.3-3 if vddk version is 6.7. (3)Please check the result of step3, can see that the performance of virt-v2v-1.45.99-1 is much worse than that of virt-v2v-1.45.3-3 when convert guest from VMware to local without vddk.
Thanks for doing this comprehensive testing. I want to chop out some tests that I don't think are useful for performance testing: - Anything that uses rhv-proxy (!rhv-direct) is always going to be slow because all data goes through a proxy, so there's no point testing it. This mode is only needed for corner cases where you don't have direct access to the ovirt hosts. - Also anything with VDDK < 7.0 is not worth testing because even with ancient RHV or VMware it's easy to upgrade VDDK. - Also any test that doesn't use VDDK (ie. is using ssh or https) as these are known to be much slower and we only provide them for people who don't or can't use VDDK for licensing reasons. (NB I'm just talking about *performance* testing. We still need to test that all these different modes work.) That leaves these performance tests: (from Step 1) v2v_version ESXi6.7+vddk7.0.2 1.45.99-1 Convert guest: 1m1s Copying disk: 9m44s 1.45.3-3(rhv-direct=true) Convert guest: 1m2s Copying disk: 7m34s - Modular virt-v2v is about 25% slower, and there's no obvious reason looking at the logs. I suspect that tuning the request size might help. Let's look at this in RHEL 9.1. (from Step 2) v2v_version ESXi6.5+vddk7.0.2 1.45.99-1 Convert guest: 55s Copying disk: 12m37s 1.45.3-3(rhv-direct=true) Convert guest: 3m35s Copying disk: 10m12s - Modular virt-v2v is marginally faster overall. What's odd about this is actually that old virt-v2v took so long to do the conversion. Old virt-v2v spends ages doing QueryAllocatedBlocks requests during conversion (modular virt-v2v does not). I'm not sure I understand what's going on there. (from Step 3.1) - Tests use curl or ssh, not VDDK. (from Step 3.2) - Tests use curl, not VDDK. (from Step 4) v2v_version ova_to_openstack 1.45.99-1 Convert guest: 3m59s Copying disk: 6m53s 1.45.3-3 Convert guest: 3m51s Copying disk: 7m5s - There's not a very large difference here, but -i ova in modular virt-v2v is known to be a bit slower. We can work on making this better for RHEL 9.1. My conclusion is I'm not very worried :-) Nothing is broken. The performance differences are small. We understand well which input and output modes are fast (VDDK) and which are slow (curl & ssh). We'll work on performance improvements in RHEL 9.1.
> (from Step 2) > - Modular virt-v2v is marginally faster overall. What's odd about this is > actually that old virt-v2v took so long to do the conversion. Old virt-v2v > spends ages doing QueryAllocatedBlocks requests during conversion (modular > virt-v2v does not). I'm not sure I understand what's going on there. Oh I think I see. This is ESX 6.5 which had weird problems with extent mapping. Note that ESX 6.5 will be out of support in Oct 2022, which is before RHEL 9.1 is released so I doubt many customers will be using this combination.
Thanks rjones, I think the bug can be moved to VERIFIED status according to comment36 ~ comment41
(In reply to mxie from comment #38) > 3.1 > > #virt-v2v -ic vpx://root.198.169/data/10.73.199.217/?no_verify=1 -ip > /home/passwd -o local -os /home esx7.0-win2019-x86_64 > > #virt-v2v -i vmx -it ssh > ssh://root.75.219/vmfs/volumes/esx6.7-matrix/esx6.7-rhel8.5-x86_64/ > esx6.7-rhel8.5-x86_64.vmx -ip /home/passwd -o rhv -os > 10.73.195.48:/home/nfs_export > > ----------------------------------------------------------------------------- > v2v_version VMware_to_local vmx+ssh_to_rhv > > 1.45.99-1 Convert guest: 22m31s Convert guest: 2m3s > Copying disk: 74m49s Copying disk: 6m38s > > > 1.45.3-3 Convert guest: 10m25s Convert guest: 2m > Copying disk: 42m41s Copying disk: 5m15s > ----------------------------------------------------------------------------- I was looking into the cases above (curl -> null|local, ssh -> null|local). I said above that we don't really care about performance testing here, and that's true, but I wanted to see if we are missing any easy wins. I set up a simple test without VMware and compared 1.45.3 and 1.45.99 (methodology at end). However I was not able to see a case where modular virt-v2v is slower beyond measurement errors, and in fact it's a bit faster in some cases: version curl -> null ssh -> null curl -> local ssh -> local 1.45.3 190.0 221.5 186.7 220.6 1.45.99 187.5 191.9 187.4 191.8 Methodology: For both cases, I put a Fedora 20 virt-builder image on to a local web server. For curl, create /var/tmp/input.xml from the example in the virt-v2v manual, modifying the disk section: <disk type='network' device='disk'> <driver name='qemu' type='raw'/> <source protocol='http' name='/fedora-20.img'> <host name='webserver' port='80'/> </source> <target dev='hda' bus='ide'/> </disk> For ssh, I modified an existing .vmx file to point to the Fedora image and hosted that on the same webserver. $ virt-v2v -i libvirtxml /var/tmp/input.xml [ -o null | -o local -os /var/tmp ] $ virt-v2v -i vmx -it ssh ssh://webserver/public_html/fedora-20.vmx [ -o null | -o local -os /var/tmp ]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: virt-v2v), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:2566