2039255 – Conversion performance is not good when convert guest by modular virt-v2v

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2039255 - Conversion performance is not good when convert guest by modular virt-v2v

Summary: Conversion performance is not good when convert guest by modular virt-v2v

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	virt-v2v
Sub Component:
Version:	9.0
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Richard W.M. Jones
QA Contact:	mxie@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	2053103
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-11 11:22 UTC by mxie@redhat.com
Modified:	2022-05-17 13:43 UTC (History)
CC List:	13 users (show)
Fixed In Version:	virt-v2v-1.45.99-1.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-17 13:41:56 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-107404	0	None	None	None	2022-01-11 11:27:31 UTC
Red Hat Product Errata	RHEA-2022:2566	0	None	None	None	2022-05-17 13:42:10 UTC

Description mxie@redhat.com 2022-01-11 11:22:00 UTC

Description of problem:
Conversion performance is not good when convert guest by modular virt-v2v

Version-Release number of selected component (if applicable):
virt-v2v-1.45.96-1.el9.x86_64
libguestfs-1.46.1-2.el9.x86_64
guestfs-tools-1.46.1-6.el9.x86_64
nbdkit-server-1.28.4-1.el9.x86_64
libvirt-libs-8.0.0-0rc1.1.el9.x86_64
qemu-img-6.2.0-3.el9.x86_64
virtio-win-1.9.19-5.el9_b.noarch
python3-ovirt-engine-sdk4-4.4.15-1.el9ev.x86_64
rhv-4.4.10.2-0.1.el8ev

How reproducible:
80%

Steps to Reproduce:
1. Convert 4 guests from different VMware via vddk+rhv-upload at the same time to compare the performance between virt-v2v-1.45.96-1 and virt-v2v-1.45.3-3

#virt-v2v -ic vpx://root@vcenter_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/$vddk -io  vddk-thumbprint=B5:52:1F:B4:21:09:45:24:51:32:56:F6:63:6A:93:5D:54:08:2D:78  -ip /home/passwd  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt $guest


v2v_version   ESXi7.0+vddk7.0.2        ESXi7.0+vddk6.7         ESXi6.7+vddk7.0.2        ESXi6.7+vddk6.7     

1.45.96-1     Convert guest: 1m2s      Convert guest: 3m32s    Convert guest: 1m5s      Convert guest: 5m2s     
              Copying disk:  41m45s    Copying disk:  36m10s   Copying disk:  40m2s     Copying disk:  36m19s  
                               
 
1.45.3-3      Convert guest: 1m8s      Convert guest: 3m7s     Convert guest: 1m8s      Convert guest: 4m13s 
              Copying disk:  10m47s    Copying disk:  9m58s    Copying disk:  9m45s     Copying disk:  9m59s 
 


2. Convert guests with below ways at the same time to compare the performance between virt-v2v-1.45.96-1 and virt-v2v-1.45.3-3

#virt-v2v -ic vpx://root@vcenter_ip/data/esxi_host/?no_verify=1 -ip /home/passwd -o json -os /home $guest

#virt-v2v -i vmx -it ssh ssh://root@esxi_host/vmfs/volumes/esx6.7-matrix/esx6.7-rhel8.4-x86_64/esx6.7-rhel8.4-x86_64.vmx  -ip /home/passwd -o local -os /home

-----------------------------------------------------------------------------
v2v_version     rhv_to_json               vmx+ssh_to_local 
          
1.45.96-1      Convert guest: 16m46s      Convert guest: 1m22s
               Copying disk:  13m30s      Copying disk: 50s 


1.45.3-3       Convert guest: 9m48s       Convert guest: 1m10s
               Copying disk:  8m20s       Copying disk:  2m9s 
------------------------------------------------------------------------------ 



Actual results:
(1)Converting performance of virt-v2v-1.45.96-1 is almost same with virt-v2v-1.45.3-3 when convert guest from vmware via vddk and vmx+ssh 
(2)Converting performance of virt-v2v-1.45.96-1 is not as good as virt-v2v-1.45.3-3 when convert guest from vmware without vddk
(3)Copying disk performance of virt-v2v-1.45.96-1 is not as good as virt-v2v-1.45.3-3 when convert guest to rhv via rhv-upload
(4)Copying disk performance of virt-v2v-1.45.96-1 is better than virt-v2v-1.45.3-3 only when target is local
(5)Copying disk performance of virt-v2v-1.45.96-1 is not as good as virt-v2v-1.45.3-3 only when target is json

Expected results:
Conversion performance of modular virt-v2v is good

Additional info:
Can't test output '-o rhv' because of bug2027598

Comment 1 Richard W.M. Jones 2022-01-11 11:26:57 UTC

FWIW some upstream analysis:

https://listman.redhat.com/archives/libguestfs/2022-January/thread.html#00055
https://listman.redhat.com/archives/libguestfs/2022-January/thread.html#00057
https://listman.redhat.com/archives/libguestfs/2022-January/thread.html#00058

Comment 3 Richard W.M. Jones 2022-01-11 11:30:25 UTC

So one thing that came out of the upstream analysis is that modular
virt-v2v always flushes the data out to disk, whereas old virt-v2v
did not do that.

As a result, to get fair comparisons you *must* do a "sync" after
virt-v2v, and include the time taken for sync in the total time.  For
example:

  $ virt-v2v -i ... -o ...
  $ time sync

  real    0m48.795s
  user    0m0.000s
  sys     0m0.062s

And add 48 seconds to the total time.

It probaby won't close the gap given the large differences shown in comment 0.

Comment 4 Richard W.M. Jones 2022-02-08 11:58:41 UTC

I'm not sure if we have a specific bug for -i disk -> -o rhv-upload, but
I'm finally able to reproduce the slow down in this case locally, and
it is very clear and reproducible.

$ time ./run virt-v2v -i disk fedora-35.qcow2 -o rhv-upload -oc https://ovirt4410/ovirt-engine/api -op /tmp/ovirt-passwd -oo rhv-direct -os ovirt-data -on test3 

  Virt-v2v 1.44.2:           1m22
  Virt-v2v 1.45.97@18b11018: 2m20

The test guest is the standard performance guest described here:
https://listman.redhat.com/archives/libguestfs/2022-January/msg00055.html

Comment 5 Richard W.M. Jones 2022-02-08 13:27:56 UTC

Not much clue what's going on here, but I posted some questions upstream:
https://listman.redhat.com/archives/libguestfs/2022-February/msg00109.html

Comment 6 Richard W.M. Jones 2022-02-09 10:33:07 UTC

It turns out that we were tickling a 60 second timeout in the oVirt code
because modular virt-v2v changed the order in which some operations were
done (in a neutral way - this is really a problem in oVirt).  Long story
is here: https://listman.redhat.com/archives/libguestfs/2022-February/thread.html#00111

I have a small patch which restores performance so it's now about the
same as old virt-v2v within measurement error (sometimes a bit faster).

Virt-v2v 1.44.2:

$ time ./run virt-v2v -i disk /var/tmp/fedora-35.qcow2 -o rhv-upload -oc https://ovirt4410.home.annexia.org/ovirt-engine/api -op /tmp/ovirt-passwd -oo rhv-direct -os ovirt-data -on test11 -of raw 
[   0.4] Opening the source -i disk /var/tmp/fedora-35.qcow2
[   0.5] Creating an overlay to protect the source from being modified
[   0.5] Opening the overlay
[   8.3] Inspecting the overlay
[  10.9] Checking for sufficient free disk space in the guest
[  10.9] Estimating space required on target for each disk
[  10.9] Converting Fedora Linux 35 (Thirty Five) to run on KVM
virt-v2v: warning: /files/boot/grub2/device.map/hd0 references unknown 
device "vda".  You may have to fix this entry manually after conversion.
virt-v2v: This guest has virtio drivers installed.
[  36.9] Mapping filesystem data to avoid copying unused and blank areas
[  37.7] Closing the overlay
[  38.4] Assigning disks to buses
[  38.4] Checking if the guest needs BIOS or UEFI to boot
[  38.4] Initializing the target -o rhv-upload -oc https://ovirt4410.home.annexia.org/ovirt-engine/api -op /tmp/ovirt-passwd -os ovirt-data
[  39.7] Copying disk 1/1 to qemu URI json:{ "file.driver": "nbd", "file.path": "/run/user/1000/v2vnbdkit.kQgYQb/nbdkit3.sock", "file.export": "/" } (raw)
    (100.00/100%)
[  72.8] Creating output metadata
[  73.3] Finishing off

real	1m13.644s
user	0m1.832s
sys	0m4.845s

Virt-v2v 1.45.97 + patch:

$ time ./run virt-v2v -i disk /var/tmp/fedora-35.qcow2 -o rhv-upload -oc https://ovirt4410.home.annexia.org/ovirt-engine/api -op /tmp/ovirt-passwd -oo rhv-direct -os ovirt-data -on test10 -of raw
[   0.0] Setting up the source: -i disk /var/tmp/fedora-35.qcow2
[   1.0] Opening the source
[   9.0] Inspecting the source
[  11.7] Checking for sufficient free disk space in the guest
[  11.7] Converting Fedora Linux 35 (Thirty Five) to run on KVM
virt-v2v: warning: /files/boot/grub2/device.map/hd0 references unknown 
device "vda".  You may have to fix this entry manually after conversion.
virt-v2v: This guest has virtio drivers installed.
[  38.1] Mapping filesystem data to avoid copying unused and blank areas
[  39.1] Closing the overlay
[  39.7] Assigning disks to buses
[  39.7] Checking if the guest needs BIOS or UEFI to boot
[  39.7] Setting up the destination: -o rhv-upload -oc https://ovirt4410.home.annexia.org/ovirt-engine/api -os ovirt-data
[  59.5] Copying disk 1/1
█ 100% [****************************************]
[  71.2] Creating output metadata
[  74.1] Finishing off

real	1m14.365s
user	0m8.183s
sys	0m13.769s

The patch is here:

https://listman.redhat.com/archives/libguestfs/2022-February/msg00128.html

Comment 7 Richard W.M. Jones 2022-02-09 13:41:39 UTC

This patch is now upstream in:
https://github.com/libguestfs/virt-v2v/commit/d69ba56b2f4bc642ce59bfc6bdd5c137480bf8c3

I have a further question for Nir and it may be possible to gain
significantly more performance, so I'm going to leave this bug in
POST for now:
https://listman.redhat.com/archives/libguestfs/2022-February/msg00130.html

Comment 8 Richard W.M. Jones 2022-02-10 14:33:57 UTC

I've created an upstream oVirt bug to discuss the slow disk creation
problem: bug 2053103

Comment 10 tingting zheng 2022-02-14 05:05:04 UTC

Refer to comment 9,move the bug back to ASSIGNED.

Comment 11 Nir Soffer 2022-02-14 13:31:20 UTC

(In reply to Richard W.M. Jones from comment #3)
> So one thing that came out of the upstream analysis is that modular
> virt-v2v always flushes the data out to disk, whereas old virt-v2v
> did not do that.

Old virt-v2v used qemu-img convert, and it always flushes at the end. It would be
a bad if it didn't.

Here is an example:

$ nbdkit -f -v file file=dst.raw 2>&1 | grep flush
nbdkit: file[1]: debug: file: can_flush
nbdkit: file.0: debug: file: flush

$ qemu-img convert -n -W fedora-35.raw nbd://localhost

flush was called.

When using old virt-v2v, we always had 2 flushes in imageio logs. One flush
came from qemu-img, the other one from nbdkit (bug).

> As a result, to get fair comparisons you *must* do a "sync" after
> virt-v2v, and include the time taken for sync in the total time.  For
> example:
> 
>   $ virt-v2v -i ... -o ...
>   $ time sync
> 
>   real    0m48.795s
>   user    0m0.000s
>   sys     0m0.062s
> 
> And add 48 seconds to the total time.

This should not be needed to compare times.

Comment 12 Nir Soffer 2022-02-14 14:13:48 UTC

(In reply to mxie from comment #9)
I think we need logs to understand whats going on. I'm not sure about the flows
involving vddk, but the the local import to rhv, it wil help if you attach
here the v2v log and imageio logs from the host performing the import.

Comment 13 Richard W.M. Jones 2022-02-14 14:28:59 UTC

(In reply to Nir Soffer from comment #11)
> (In reply to Richard W.M. Jones from comment #3)
> > So one thing that came out of the upstream analysis is that modular
> > virt-v2v always flushes the data out to disk, whereas old virt-v2v
> > did not do that.
> 
> Old virt-v2v used qemu-img convert, and it always flushes at the end. It
> would be
> a bad if it didn't.
> 
> Here is an example:
> 
> $ nbdkit -f -v file file=dst.raw 2>&1 | grep flush
> nbdkit: file[1]: debug: file: can_flush
> nbdkit: file.0: debug: file: flush
> 
> $ qemu-img convert -n -W fedora-35.raw nbd://localhost
> 
> flush was called.

This isn't always true.  Old virt-v2v in modes such as -o local actually
did something like this (ie. writing directly to the output file):

$ qemu-img convert -n -W overlay.qcow2 guest.img

If you strace the qemu-img command you'll see it doesn't fsync the output.

You can also try this with virt-v2v 1.44.2:

$ sync; ./run virt-v2v -i disk /var/tmp/fedora-35.qcow2 -o local -os /var/tmp/ -of raw; time sync

and you'll see the final sync command takes a few seconds (depending on the size of
the input and speed of the disk).

I think qemu-img convert behaves differently if the output is an NBD server.
It appears if the server advertises flush then it will send it at the end.

> When using old virt-v2v, we always had 2 flushes in imageio logs. One flush
> came from qemu-img, the other one from nbdkit (bug).
> 
> > As a result, to get fair comparisons you *must* do a "sync" after
> > virt-v2v, and include the time taken for sync in the total time.  For
> > example:
> > 
> >   $ virt-v2v -i ... -o ...
> >   $ time sync
> > 
> >   real    0m48.795s
> >   user    0m0.000s
> >   sys     0m0.062s
> > 
> > And add 48 seconds to the total time.
> 
> This should not be needed to compare times.

Maybe not for -o rhv-upload, but it definitely is for other outputs.

(In reply to Nir Soffer from comment #12)
> (In reply to mxie from comment #9)
> I think we need logs to understand whats going on. I'm not sure about the
> flows
> involving vddk, but the the local import to rhv, it wil help if you attach
> here the v2v log and imageio logs from the host performing the import.

Yes I'd like to see the logs too.

Comment 15 Nir Soffer 2022-02-14 14:53:02 UTC

(In reply to Richard W.M. Jones from comment #13)
> (In reply to Nir Soffer from comment #11)
> > (In reply to Richard W.M. Jones from comment #3)
> > > So one thing that came out of the upstream analysis is that modular
> > > virt-v2v always flushes the data out to disk, whereas old virt-v2v
> > > did not do that.
> > 
> > Old virt-v2v used qemu-img convert, and it always flushes at the end. It
> > would be
> > a bad if it didn't.
> > 
> > Here is an example:
> > 
> > $ nbdkit -f -v file file=dst.raw 2>&1 | grep flush
> > nbdkit: file[1]: debug: file: can_flush
> > nbdkit: file.0: debug: file: flush
> > 
> > $ qemu-img convert -n -W fedora-35.raw nbd://localhost
> > 
> > flush was called.
> 
> This isn't always true.  Old virt-v2v in modes such as -o local actually
> did something like this (ie. writing directly to the output file):
>
> $ qemu-img convert -n -W overlay.qcow2 guest.img
> 
> If you strace the qemu-img command you'll see it doesn't fsync the output.

Yes, this very bad:

$ strace -f -e fdatasync qemu-img convert fedora-35.raw dst.raw 2>&1 | grep fdatasync

$ strace -f -e fdatasync qemu-img convert -t unsafe fedora-35.raw dst.raw 2>&1 | grep fdatasync

Fortunately RHV always use -t none so we always have a flush:

$ strace -f -e fdatasync qemu-img convert -t writeback fedora-35.raw dst.raw 2>&1 | grep fdatasync
[pid 30951] fdatasync(8)                = 0

$ strace -f -e fdatasync qemu-img convert -t none fedora-35.raw dst.raw 2>&1 | grep fdatasync
[pid 31187] fdatasync(8)                = 0

Kevin, avoiding flushes during the copy makes sense, but flushing at the end sounds
like a better default to me for the use case of qemu-img convert. Should we file
qemu-img bug for this?

Comment 16 Nir Soffer 2022-02-14 15:16:15 UTC

(In reply to mxie from comment #14)
> (In reply to Nir Soffer from comment #12)
> > (In reply to mxie from comment #9)
> > I think we need logs to understand whats going on. I'm not sure about the
> > flows
> > involving vddk, but the the local import to rhv, it wil help if you attach
> > here the v2v log and imageio logs from the host performing the import.
> 
> All v2v conversions of comment9 are executed on standalone v2v server rather
> than rhv node and v2v debugs logs of comment9 are in
> http://fileshare.englab.nay.redhat.com/pub/section3/libvirtmanual/mxie/pre-
> verify-bug2039255/

Are you using -oo rhv-direct=true when using -o rhv-upload?

If you don't every request to imageio server go via the imageio proxy on 
engine host. This is typically 50%s slower compare with sending to directly
to the host.

With small requests (moduular virt-v2v use 256k instead of 2m), this create
huge overhead that may explain the slow down.

This is also *not* the recommended usage and testing this in context of
performance testing is not good use of our time.

When testing performance should always use:

    virt-v2v -o rhv-upload -oo rhv-direct=true

And we should run virt-v2v on a RHV hypervisor node (not on the manager node).

Testing on another host which is not part of the cluster is nice to have.

Testing without -oo rhv-direct is nice to have for completeness, but I don't
think we should spend time on this use case.

Comment 17 mxie@redhat.com 2022-02-14 15:36:36 UTC

> Are you using -oo rhv-direct=true when using -o rhv-upload?
> If you don't every request to imageio server go via the imageio proxy on 
> engine host. This is typically 50%s slower compare with sending to directly
> to the host.
> 
> With small requests (moduular virt-v2v use 256k instead of 2m), this create
> huge overhead that may explain the slow down.
> 
> This is also *not* the recommended usage and testing this in context of
> performance testing is not good use of our time.
> 
> When testing performance should always use:
> 
>     virt-v2v -o rhv-upload -oo rhv-direct=true

Hi Richard, do you think it's necessary to add -oo rhv-direct=true for rhv-upload conversions to retest the step1 of commnet9?

> And we should run virt-v2v on a RHV hypervisor node (not on the manager
> node).
> Testing on another host which is not part of the cluster is nice to have.

The v2v version to be tested is rhel9 build but rhv4.4 node is based on rhel8, so can't run v2v on rhv node

Comment 18 Richard W.M. Jones 2022-02-14 15:56:05 UTC

My testing is using 4 separate machines:

VMware ESXi ----> virt-v2v ----> oVirt host (& oVirt Engine separately)
                  Fedora 36       RHEL 8.5        RHEL 8.5

I'm _not_ using rhv-direct, because I thought that this only works when you
run virt-v2v on the oVirt host.  Also for the same reason as mxie above,
I cannot test recent virt-v2v on RHEL 8.

I'll also note: https://bugzilla.redhat.com/show_bug.cgi?id=2033096

Comment 19 Richard W.M. Jones 2022-02-14 15:58:28 UTC

I should say that when comparing virt-v2v 1.44 and virt-v2v 1.45.98,
I'm running both on the same Fedora 36 / Rawhide.  So even if using
the rhv proxy is bad, I'm not sure it can be the cause of the problem.

Comment 20 Richard W.M. Jones 2022-02-14 16:09:04 UTC

Sorry, I accidentally cancelled the needinfo set for Kevin on comment 15.

Comment 21 Nir Soffer 2022-02-14 16:11:15 UTC

(In reply to Richard W.M. Jones from comment #18)
> My testing is using 4 separate machines:
> 
> VMware ESXi ----> virt-v2v ----> oVirt host (& oVirt Engine separately)
>                   Fedora 36       RHEL 8.5        RHEL 8.5
> 
> I'm _not_ using rhv-direct, because I thought that this only works when you
> run virt-v2v on the oVirt host.  Also for the same reason as mxie above,
> I cannot test recent virt-v2v on RHEL 8.

-oo rhv-direct=true works and should be the default in virt-v2v to avoid this
confusion.

Comment 22 Nir Soffer 2022-02-14 16:18:38 UTC

(In reply to Richard W.M. Jones from comment #19)
> I should say that when comparing virt-v2v 1.44 and virt-v2v 1.45.98,
> I'm running both on the same Fedora 36 / Rawhide.  So even if using
> the rhv proxy is bad, I'm not sure it can be the cause of the problem.

I think amplify the problem - for every nbd command, we:
1. send http request to the proxy
2. the proxy send http request to the host
3. the host send nbd command to qemu-nbd
4. the host return reply to the proxy
5. the proxy return reply to us

I see 1.8x speed up when using local virt-v2v, communicating with imageio
via unix socke by increasing request size from 256k to 4m. When working
with remove server via a proxy, the overhead for each request is much larger.

Of course there may be an issue on the input side, which changed a lot, 
but comment 9 show clear issue on the output side:

> v2v_version     local_to_rhv_upload              
>          
> 1.45.98-1      Convert guest: 2m49s     
>                Copying disk:  7m57s     
>
>
> 1.45.3-3       Convert guest: 2m29s      
>                Copying disk:  3m51s

Comment 23 Richard W.M. Jones 2022-02-14 16:50:54 UTC

(In reply to Nir Soffer from comment #21)
> -oo rhv-direct=true works and should be the default in virt-v2v to avoid this
> confusion.

I just realised that I _am_ in fact using -oo rhv-direct!  It was hiding
in the very long (6 line) virt-v2v command I'm using.

If it works, then yes we should use it, and as you can see from bug 2033096
the aim is to make that the default.

Out of interest I just reran my tests with and without -oo rhv-direct
to see if it could account for the difference.  Results below for me.
These are all local vddk -> -o rhv-upload, using VDDK 7.0.3 and ESXi 7:

                      -oo rhv-direct=true     -oo rhv-direct=false
  virt-v2v 1.45.98     5m15                    5m19
  virt-v2v 1.45.3      6m11                    5m51
  virt-v2v 1.44.2      5m11                    4m59

At the moment I cannot reliably reproduce this bug.

Comment 25 Nir Soffer 2022-02-14 17:34:01 UTC

(In reply to Richard W.M. Jones from comment #23)
> Out of interest I just reran my tests with and without -oo rhv-direct
> to see if it could account for the difference.  Results below for me.
> These are all local vddk -> -o rhv-upload, using VDDK 7.0.3 and ESXi 7:
> 
>                       -oo rhv-direct=true     -oo rhv-direct=false
>   virt-v2v 1.45.98     5m15                    5m19
>   virt-v2v 1.45.3      6m11                    5m51
>   virt-v2v 1.44.2      5m11                    4m59

The timing look identical - looks like -oo rhv-direct is broken. We either
always use direct more or never.

Do you have imageio logs for these tests? We see the client address in
the CLOSE log:

    2022-02-14 19:28:00,215 INFO    (Thread-125) [http] CLOSE connection=125 client=::ffff:192.168.122.10 ...

Comment 26 Richard W.M. Jones 2022-02-14 17:55:53 UTC

With -oo rhv-direct=true:

client=local
client=local
client=::ffff:192.168.0.139
client=local
client=local
client=::ffff:192.168.0.139
client=::ffff:192.168.0.139
client=::ffff:192.168.0.139
client=::ffff:192.168.0.139
client=local

(192.168.0.139 == IP address of machine running virt-v2v)

With -oo rhv-direct=false:

client=local
client=local
client=::ffff:192.168.0.210
client=local
client=local
client=local
client=::ffff:192.168.0.210
client=::ffff:192.168.0.210
client=::ffff:192.168.0.210
client=::ffff:192.168.0.210

(192.168.0.210 == IP address of oVirt engine)

Comment 27 Nir Soffer 2022-02-14 19:22:43 UTC

(In reply to Richard W.M. Jones from comment #26)
Your test looks correct. So this show that the bottleneck is the input
side - if you send data slow enough, it does not matter if you use the proxy
or not.

Because nbdcopy uses async read and writes, the reads are never blocked by
slow writes. In imageio client we have 4 threads, but every one is block
either on read or on write, so slow write slows also reading from source.

This is what I see on my test system, using oVirt 4.5, uploading local file
(fedora 35 + 3g of random data) from my laptop to oVirt system running as 
vms on the laptop.

I tested these combinations:

#   combination                                        seconds  
--------------------------------------------------------------
1  Simulating virt-v2v[1] request size 256k             16.98
2  Simulating virt-v2v[1] request size 4m                9.33
3  Simulating virt-v2v[1] via proxy, request size 256k  30.60
4  Simulating virt-v2v[1] via proxy, request size 4m    15.21
5  Normal upload[2] read size 4m                         8.49
6  Normal upload[2] read size 256k                       7.77
7  Normal upload[2] via proxy, read size 4m             14.68
8  Normal upload[2] via proxy, read size 256k           12.63

[1] detecting zeroes and sending small (256k) http requests. Send one HTTP
    request per one NBD read.

[2] The normal upload uses read size only for reading from qemu-nbd, sending one
    HTTP request per extent (splitting large extents to 128m chunks).

You can see that there is a huge difference between sending one request
per nbd read and one request per extent, and using the proxy is much 
slower. Also it the advantage of larger request size is very clear.

Comment 30 Kevin Wolf 2022-02-15 10:32:14 UTC

(In reply to Nir Soffer from comment #15)
> Kevin, avoiding flushes during the copy makes sense, but flushing at the end
> sounds like a better default to me for the use case of qemu-img convert. Should we
> file qemu-img bug for this?

I'm not sure about the right behaviour there, it completely depends on what you're going to do with the copy next. As this is with 'cache=unsafe', I think it makes sense to optimise for the shortest runtime of qemu-img and avoid the flush - it might be just a temporary file that you continue processing from the page cache and flushing would only be unnecessary overhead. If you do want to have the image safe on disk, you can always call 'sync' on the file next.

Comment 36 mxie@redhat.com 2022-02-16 11:21:59 UTC

Verify the bug with below builds:
virt-v2v-1.45.99-1.el9.x86_64
libguestfs-1.46.1-2.el9.x86_64
guestfs-tools-1.46.1-6.el9.x86_64
libvirt-libs-8.0.0-4.el9.x86_64
qemu-img-6.2.0-8.el9.x86_64
nbdkit-1.28.5-1.el9.x86_64
libnbd-1.10.5-1.el9.x86_64
virtio-win-1.9.19-5.el9_b.noarch


Steps:
1. Convert three different guests from ESXi7.0 to rhv(rhv-upload) via different version vddk at the same time to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3, besides, considering add -oo rhv-direct/rhv-proxy option to command line when convert guest with different v2v version

#virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk7.0.2 -io  vddk-thumbprint=xx:xx:xx:...  -ip /home/passwd  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx7.0-win2022-x86_64 (-oo rhv-direct/rhv-proxy=true)

#virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.7 -io  vddk-thumbprint=xx:xx:xx:...  -ip /home/passwd   -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt  esx7.0-rhel8.5-x86_64 (-oo rhv-direct/rhv-proxy=true)

#virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.5 -io  vddk-thumbprint=xx:xx:xx:... -ip /home/passwd   -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt  esx7.0-win11-x86_64 (-oo rhv-direct/rhv-proxy=true)

-------------------------------------------------------------------------------------------------
v2v_version                ESXi7.0+vddk7.0.2        ESXi7.0+vddk6.7          ESXi7.0+vddk6.5       

1.45.99-1                  Convert guest: 57s       Convert guest: 2m25s     Convert guest: 46s      
                           Copying disk:  9m44s     Copying disk:  10m4s     Copying disk:  15m57s     
                               
1.45.3-3(rhv-direct=true)  Convert guest: 55s       Convert guest: 2m19s     Convert guest: 1m     
                           Copying disk:  9m30s     Copying disk:  9m25s     Copying disk:  12m48s


--------------------------------------------------------------------------------------------------
v2v_version                ESXi7.0+vddk7.0.2        ESXi7.0+vddk6.7          ESXi7.0+vddk6.5       

1.45.99-1(rhv-proxy=true)  Convert guest: 55s       Convert guest: 2m26s     Convert guest: 55s      
                           Copying disk:  10m33s    Copying disk:  10m33s    Copying disk:  16m7s     
                               
1.45.3-3                   Convert guest: 59s       Convert guest: 2m14s     Convert guest: 54s     
                           Copying disk:  10m26s    Copying disk:  10m17s    Copying disk:  13m40s


Hi Richard,

   According to the current test results, the performance of virt-v2v-1.45.99-1 has been greatly improved, which is basically similar to that of virt-v2v-1.45.3-3, but you can see that the performance of virt-v2v-1.45.99-1 is still a little worse than that of the old version virt-v2v-1.45.3-3 when vddk version is 7.0.2 and 6.7，the performance gap is obvious when vddk version is 6.5, do you think the performance difference is acceptable? If yes, I will continue to test the other scenarios to compare their performance.

Comment 37 Richard W.M. Jones 2022-02-16 17:00:29 UTC

Thanks for the testing.  Some broader points first:

 - I'm only really concerned about VDDK 7 & ESXi 7.  By the time people are
   really using this in RHEL 9 (probably 9.1) they will have upgraded to both.

 - We only need to do performance testing with -oo rhv-direct (which is now
   the default).  -oo rhv-proxy is designed for situations where you don't
   have direct network access to the storage, and those are always going to
   be slow because it has to go through a proxy.

 - By the way, -oo rhv-direct[=true] is the default so it's no longer needed.
   -oo rhv-proxy[=true] is the opposite (use a proxy, slow).

I think the performance numbers look fine now.  It's still a few % slower,
but modular virt-v2v is more capable for a few reasons:

 - modular virt-v2v will allow an external program to do the copying, which
   means that whole system performance will be better (eventually, once we
   implement this fully)

 - nbdcopy doesn't trash the page cache when copying to local, again a
   benefit to whole system performance that's not visible for single copies

> the performance gap is obvious when vddk version is 6.5

VDDK 6.5 didn't support extents, so it'll end up copying much more data.
If customers ever hit this case we'll tell them to upgrade to the latest
VDDK, which is usually a simple thing to do.

Comment 38 mxie@redhat.com 2022-02-18 04:50:22 UTC

As VDDK 6.5 didn't support extents, regardless of its impact on virt-v2v-1.45.99-1 performance, continue to verify the bug with below builds:

virt-v2v-1.45.99-1.el9.x86_64
libguestfs-1.46.1-2.el9.x86_64
guestfs-tools-1.46.1-6.el9.x86_64
libvirt-libs-8.0.0-4.el9.x86_64
qemu-img-6.2.0-8.el9.x86_64
nbdkit-1.28.5-1.el9.x86_64
libnbd-1.10.5-1.el9.x86_64
virtio-win-1.9.19-5.el9_b.noarch


Steps:
1. Convert two different guests from ESXi6.7 to rhv(rhv-upload) via vddk7.0.2 and vddk6.7 at the same time to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3

#virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk7.0.2 -io  vddk-thumbprint=xx:xx:xx:...  -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx7.0-win2022-x86_64 

#virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.7 -io  vddk-thumbprint=xx:xx:xx:...  -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt  esx7.0-rhel8.5-x86_64 

-----------------------------------------------------------------------------
v2v_version                ESXi6.7+vddk7.0.2        ESXi6.7+vddk6.7                

1.45.99-1                  Convert guest: 1m1s      Convert guest: 2m31s          
                           Copying disk:  9m44s     Copying disk:  10m19s         
                               
1.45.3-3(rhv-direct=true)  Convert guest: 1m2s      Convert guest: 2m23s         
                           Copying disk:  7m34s     Copying disk:  7m36s     

-----------------------------------------------------------------------------
v2v_version                ESXi6.7+vddk7.0.2        ESXi6.7+vddk6.7                

1.45.99-1(rhv-proxy=true)  Convert guest: 51s       Convert guest: 2m26s      
                           Copying disk:  10m4s     Copying disk:  10m41s         
                               
1.45.3-3                   Convert guest: 55s       Convert guest: 2m14s        
                           Copying disk:  7m40s     Copying disk:  7m51s    
-----------------------------------------------------------------------------

2. Convert two different guests from ESXi6.5 to rhv(rhv-upload) via vddk7.0.2 and vddk6.7 at the same time to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3

#virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk7.0.2 -io  vddk-thumbprint=xx:xx:xx:...  -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx7.0-win2022-x86_64 

#virt-v2v -ic vpx://root@center_ip/data/esxi_host/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.7 -io  vddk-thumbprint=xx:xx:xx:...  -ip /home/passwd -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt  esx7.0-rhel8.5-x86_64 

-----------------------------------------------------------------------------
v2v_version                ESXi6.5+vddk7.0.2         ESXi6.5+vddk6.7                

1.45.99-1                  Convert guest: 55s        Convert guest: 4m2s          
                           Copying disk:  12m37s     Copying disk:  33m39s         
                               
1.45.3-3(rhv-direct=true)  Convert guest: 3m35s      Convert guest: 5m17s         
                           Copying disk:  10m12s     Copying disk:  21m26s     

-----------------------------------------------------------------------------
v2v_version                ESXi6.5+vddk7.0.2         ESXi6.5+vddk6.7                

1.45.99-1(rhv-proxy=true)  Convert guest: 53s        Convert guest: 2m28s      
                           Copying disk:  12m58s     Copying disk:  34m1s         
                               
1.45.3-3                   Convert guest: 3m41s      Convert guest: 4m50s        
                           Copying disk:  10m34s     Copying disk:  21m20s    
-----------------------------------------------------------------------------


3.Convert guests in below ways at the same time to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3

3.1 

#virt-v2v -ic vpx://root.198.169/data/10.73.199.217/?no_verify=1 -ip /home/passwd -o local -os /home esx7.0-win2019-x86_64

#virt-v2v -i vmx -it ssh ssh://root.75.219/vmfs/volumes/esx6.7-matrix/esx6.7-rhel8.5-x86_64/esx6.7-rhel8.5-x86_64.vmx  -ip /home/passwd -o rhv -os 10.73.195.48:/home/nfs_export 

-----------------------------------------------------------------------------
v2v_version     VMware_to_local               vmx+ssh_to_rhv 
          
1.45.99-1      Convert guest: 22m31s       Convert guest: 2m3s
               Copying disk:  74m49s       Copying disk:  6m38s 


1.45.3-3       Convert guest: 10m25s       Convert guest: 2m
               Copying disk:  42m41s       Copying disk:  5m15s 
-----------------------------------------------------------------------------

3.2 The performance gap between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3 is too large in step3.1, so convert another guest from VMware to local without vddk to compare their performance again

#virt-v2v -ic vpx://root.198.169/data/10.73.199.217/?no_verify=1 -ip /home/passwd -o local -os /home esx7.0-sles15sp2-x86_64 

-----------------------------------------------------------------------------
v2v_version     VMware_to_local               

1.45.99-1       Convert guest: 13m38s      
                Copying disk:  13m17s       


1.45.3-3        Convert guest: 9m55s       
                Copying disk:  8m13s      
-----------------------------------------------------------------------------


4.Convert a guest from ova file to openstack to compare the performance between virt-v2v-1.45.99-1 and virt-v2v-1.45.3-3

#virt-v2v -i ova /media/tools/ova-images/esx7.0-win2019-x86_64 -o openstack -oo server-id=v2v-appliance 

-----------------------------------------------------------------------------
v2v_version     ova_to_openstack               

1.45.99-1       Convert guest: 3m59s      
                Copying disk:  6m53s       


1.45.3-3        Convert guest: 3m51s       
                Copying disk:  7m5s      
-----------------------------------------------------------------------------



Hi Richard,

   (1)Please check the result of step1, the performance of virt-v2v-1.45.99-1 is a little worse than that of virt-v2v-1.45.3-3 when convert guest from ESXi6.7 to rhv(rhv-upload) with vddk7.0.2 and vddk6.7,，their performance gap is almost between 2 and 3 minutes.
   
   (2)Please check the result of step2, when convert guest from ESXi6.5 to rhv(rhv-upload) with vddk the performance of virt-v2v-1.45.99-1 is basically similar to that of virt-v2v-1.45.3-3 if vddk version is 7.0.2，but the performance of virt-v2v-1.45.99-1 is much worse than that of virt-v2v-1.45.3-3 if vddk version is 6.7.

   (3)Please check the result of step3, can see that the performance of virt-v2v-1.45.99-1 is much worse than that of virt-v2v-1.45.3-3 when convert guest from VMware to local without vddk.

Comment 40 Richard W.M. Jones 2022-02-21 14:11:28 UTC

Thanks for doing this comprehensive testing.  I want to chop out some tests
that I don't think are useful for performance testing:

- Anything that uses rhv-proxy (!rhv-direct) is always going to be slow
  because all data goes through a proxy, so there's no point testing it.
  This mode is only needed for corner cases where you don't have direct
  access to the ovirt hosts.

- Also anything with VDDK < 7.0 is not worth testing because even with
  ancient RHV or VMware it's easy to upgrade VDDK.

- Also any test that doesn't use VDDK (ie. is using ssh or https) as these
  are known to be much slower and we only provide them for people who don't
  or can't use VDDK for licensing reasons.

(NB I'm just talking about *performance* testing.  We still need to test
that all these different modes work.)

That leaves these performance tests:

(from Step 1)

v2v_version                ESXi6.7+vddk7.0.2

1.45.99-1                  Convert guest: 1m1s
                           Copying disk:  9m44s
                               
1.45.3-3(rhv-direct=true)  Convert guest: 1m2s
                           Copying disk:  7m34s

 - Modular virt-v2v is about 25% slower, and there's no obvious reason looking
   at the logs.  I suspect that tuning the request size might help.  Let's look
   at this in RHEL 9.1.

(from Step 2)

v2v_version                ESXi6.5+vddk7.0.2

1.45.99-1                  Convert guest: 55s
                           Copying disk:  12m37s
                               
1.45.3-3(rhv-direct=true)  Convert guest: 3m35s
                           Copying disk:  10m12s

 - Modular virt-v2v is marginally faster overall.  What's odd about this is
   actually that old virt-v2v took so long to do the conversion.  Old virt-v2v
   spends ages doing QueryAllocatedBlocks requests during conversion (modular
   virt-v2v does not).  I'm not sure I understand what's going on there.

(from Step 3.1)

 - Tests use curl or ssh, not VDDK.

(from Step 3.2)

 - Tests use curl, not VDDK.

(from Step 4)

v2v_version     ova_to_openstack               

1.45.99-1       Convert guest: 3m59s      
                Copying disk:  6m53s       

1.45.3-3        Convert guest: 3m51s       
                Copying disk:  7m5s

 - There's not a very large difference here, but -i ova in modular virt-v2v is
   known to be a bit slower.  We can work on making this better for RHEL 9.1.

My conclusion is I'm not very worried :-)  Nothing is broken.  The performance
differences are small.  We understand well which input and output modes are fast
(VDDK) and which are slow (curl & ssh).  We'll work on performance improvements
in RHEL 9.1.

Comment 41 Richard W.M. Jones 2022-02-21 14:15:52 UTC

> (from Step 2)
>  - Modular virt-v2v is marginally faster overall.  What's odd about this is
>    actually that old virt-v2v took so long to do the conversion.  Old virt-v2v
>    spends ages doing QueryAllocatedBlocks requests during conversion (modular
>    virt-v2v does not).  I'm not sure I understand what's going on there.

Oh I think I see.  This is ESX 6.5 which had weird problems with extent
mapping.  Note that ESX 6.5 will be out of support in Oct 2022, which
is before RHEL 9.1 is released so I doubt many customers will be using this
combination.

Comment 42 mxie@redhat.com 2022-02-21 15:51:39 UTC

Thanks rjones, I think the bug can be moved to VERIFIED status according to comment36 ~ comment41

Comment 43 Richard W.M. Jones 2022-02-22 10:38:56 UTC

(In reply to mxie from comment #38)
> 3.1 
> 
> #virt-v2v -ic vpx://root.198.169/data/10.73.199.217/?no_verify=1 -ip
> /home/passwd -o local -os /home esx7.0-win2019-x86_64
> 
> #virt-v2v -i vmx -it ssh
> ssh://root.75.219/vmfs/volumes/esx6.7-matrix/esx6.7-rhel8.5-x86_64/
> esx6.7-rhel8.5-x86_64.vmx  -ip /home/passwd -o rhv -os
> 10.73.195.48:/home/nfs_export 
> 
> -----------------------------------------------------------------------------
> v2v_version     VMware_to_local               vmx+ssh_to_rhv 
>           
> 1.45.99-1      Convert guest: 22m31s       Convert guest: 2m3s
>                Copying disk:  74m49s       Copying disk:  6m38s 
> 
> 
> 1.45.3-3       Convert guest: 10m25s       Convert guest: 2m
>                Copying disk:  42m41s       Copying disk:  5m15s 
> -----------------------------------------------------------------------------

I was looking into the cases above (curl -> null|local, ssh -> null|local).
I said above that we don't really care about performance testing here, and that's
true, but I wanted to see if we are missing any easy wins.

I set up a simple test without VMware and compared 1.45.3 and 1.45.99 (methodology
at end).  However I was not able to see a case where modular virt-v2v is slower
beyond measurement errors, and in fact it's a bit faster in some cases:

version       curl -> null    ssh -> null       curl -> local       ssh -> local
1.45.3        190.0           221.5             186.7               220.6
1.45.99       187.5           191.9             187.4               191.8

Methodology:

For both cases, I put a Fedora 20 virt-builder image on to a local web server.

For curl, create /var/tmp/input.xml from the example in the virt-v2v manual,
modifying the disk section:

            <disk type='network' device='disk'>
              <driver name='qemu' type='raw'/>
              <source protocol='http' name='/fedora-20.img'>
                <host name='webserver' port='80'/>
              </source>
              <target dev='hda' bus='ide'/>
            </disk>

For ssh, I modified an existing .vmx file to point to the Fedora
image and hosted that on the same webserver.

$ virt-v2v -i libvirtxml /var/tmp/input.xml [ -o null | -o local -os /var/tmp ]
$ virt-v2v -i vmx -it ssh ssh://webserver/public_html/fedora-20.vmx [ -o null | -o local -os /var/tmp ]

Comment 45 errata-xmlrpc 2022-05-17 13:41:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: virt-v2v), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:2566

Note You need to log in before you can comment on or make changes to this bug.