Bug 2032324

Summary:	nbdcopy command failed when using v2v to convert guest from ESXi6.5 via vddk > 6.5 + rhv-upload to rhv4.4
Product:	Red Hat Enterprise Linux 9	Reporter:	mxie <mxie>
Component:	virt-v2v	Assignee:	Richard W.M. Jones <rjones>
Status:	CLOSED ERRATA	QA Contact:	Vera <vwu>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	9.0	CC:	chhu, hongzliu, juzhou, kkiwi, lersek, mzhan, nsoffer, rjones, tyan, tzheng, vwu, xiaodwan
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:	virt-v2v-1.45.99-1.el9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-17 13:41:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2033096
Bug Blocks:

Description mxie@redhat.com 2021-12-14 11:41:10 UTC

Description of problem:
nbdcopy command failed when using v2v to convert guest from ESXi6.5 via vddk > 6.5 + rhv-upload to rhv4.4

Version-Release number of selected component (if applicable):
virt-v2v-1.45.94-3.el9.x86_64
libguestfs-1.46.1-1.el9.x86_64
guestfs-tools-1.46.1-6.el9.x86_64
nbdkit-1.28.3-2.el9.x86_64
libvirt-libs-7.10.0-1.el9.x86_64
qemu-img-6.1.0-8.el9.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Convert a guest from ESXi6.5 to rhv via vddk6.7 by virt-v2v, there are lots of nbdkit error during conversion, there is no guest in rhv after conversion although the v2v conversion is finished
# virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.7 -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA   -ip /home/passwd  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-win2019-x86_64
[  16.2] Opening the source
[  24.5] Inspecting the source
[  31.1] Checking for sufficient free disk space in the guest
[  31.1] Converting Windows Server 2019 Standard to run on KVM
virt-v2v: This guest has virtio drivers installed.
[  41.0] Mapping filesystem data to avoid copying unused and blank areas
[  42.0] Closing the overlay
[  42.3] Assigning disks to buses
[  42.3] Checking if the guest needs BIOS or UEFI to boot
[  42.3] Copying disk 1/1
nbdkit: python.5: error: /tmp/v2v.DIBlsF/rhv-upload-plugin.py: pwrite: error: Traceback (most recent call last):
   File "/tmp/v2v.DIBlsF/rhv-upload-plugin.py", line 213, in pwrite
    r = http.getresponse()
   File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
   File "/usr/lib64/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
   File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
 http.client.RemoteDisconnected: Remote end closed connection without response

nbdkit: python.6: error: /tmp/v2v.DIBlsF/rhv-upload-plugin.py: zero: error: Traceback (most recent call last):
   File "/tmp/v2v.DIBlsF/rhv-upload-plugin.py", line 242, in zero
    http.request("PATCH", url.path, body=buf, headers=headers)
   File "/usr/lib64/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
   File "/usr/lib64/python3.9/http/client.py", line 1296, in _send_request
    self.putrequest(method, url, **skips)
   File "/usr/lib64/python3.9/http/client.py", line 1122, in putrequest
    raise CannotSendRequest(self.__state)
 http.client.CannotSendRequest: Request-sent
......
█ 100% [****************************************]
nbdkit: python.3: error: /tmp/v2v.DIBlsF/rhv-upload-plugin.py: flush: error: Traceback (most recent call last):
   File "/tmp/v2v.DIBlsF/rhv-upload-plugin.py", line 291, in flush
    http.request("PATCH", url.path, body=buf, headers=headers)
   File "/usr/lib64/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
   File "/usr/lib64/python3.9/http/client.py", line 1296, in _send_request
    self.putrequest(method, url, **skips)
   File "/usr/lib64/python3.9/http/client.py", line 1122, in putrequest
    raise CannotSendRequest(self.__state)
 http.client.CannotSendRequest: Request-sent

nbd+unix:///?socket=/tmp/v2v.iDn7nb/out0: nbd_flush: flush: command failed: Input/output error
virt-v2v: error: nbdcopy command failed, see earlier error messages

If reporting bugs, run virt-v2v with debugging enabled and include the 
complete output:

  virt-v2v -v -x [...]


2.Convert a guest from ESXi6.5 to rhv via vddk7.0.2 by virt-v2v, which has same problem with step1
# virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk7.0.2 -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA   -ip /home/passwd  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-rhel8.5-x86_64
[  16.4] Opening the source
[  24.0] Inspecting the source
[  32.8] Checking for sufficient free disk space in the guest
[  32.8] Converting Red Hat Enterprise Linux 8.5 (Ootpa) to run on KVM
virt-v2v: This guest has virtio drivers installed.
[ 127.0] Mapping filesystem data to avoid copying unused and blank areas
[ 127.9] Closing the overlay
[ 128.2] Assigning disks to buses
[ 128.2] Checking if the guest needs BIOS or UEFI to boot
[ 128.2] Copying disk 1/1
nbdkit: python.0: error: /tmp/v2v.gcqkPb/rhv-upload-plugin.py: pwrite: error: Traceback (most recent call last):
   File "/tmp/v2v.gcqkPb/rhv-upload-plugin.py", line 213, in pwrite
    r = http.getresponse()
   File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
   File "/usr/lib64/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
   File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
 http.client.RemoteDisconnected: Remote end closed connection without response

nbdkit: python.3: error: /tmp/v2v.gcqkPb/rhv-upload-plugin.py: pwrite: error: Traceback (most recent call last):
   File "/tmp/v2v.gcqkPb/rhv-upload-plugin.py", line 213, in pwrite
    r = http.getresponse()
   File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
   File "/usr/lib64/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
   File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
 http.client.RemoteDisconnected: Remote end closed connection without response

nbdkit: python.7: error: /tmp/v2v.gcqkPb/rhv-upload-plugin.py: pwrite: error: Traceback (most recent call last):
   File "/tmp/v2v.gcqkPb/rhv-upload-plugin.py", line 213, in pwrite
    r = http.getresponse()
   File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
   File "/usr/lib64/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
   File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
 http.client.RemoteDisconnected: Remote end closed connection without response
......


Actual results:
Virt-v2v can convert guests from ESXi6.5 to rhv via vddk > 6.5 + rhv-upload successfully

Expected results:
As above description

Additional info:
1. Virt-v2v can convert guests from ESXi6.5 to rhv via vddk6.5 + rhv-upload successfully

#virt-v2v -ic vpx://root@virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.5 -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA   -ip /home/passwd  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-rhel8.5-x86_64
[  25.3] Opening the source
[  32.9] Inspecting the source
[  42.1] Checking for sufficient free disk space in the guest
[  42.1] Converting Red Hat Enterprise Linux 8.5 (Ootpa) to run on KVM
virt-v2v: This guest has virtio drivers installed.
[ 139.1] Mapping filesystem data to avoid copying unused and blank areas
[ 140.0] Closing the overlay
[ 140.3] Assigning disks to buses
[ 140.3] Checking if the guest needs BIOS or UEFI to boot
[ 140.3] Copying disk 1/1
█ 100% [****************************************]
[ 515.3] Creating output metadata
[ 577.5] Finishing off

2.Similar problem has been fixed in bug1939375

Comment 2 Richard W.M. Jones 2021-12-14 11:58:47 UTC

The failure is actually on the RHV imageio server:

    raise RemoteDisconnected("Remote end closed connection without"
 http.client.RemoteDisconnected: Remote end closed connection without response

However it could happen because in new virt-v2v we are actually
flushing out the data at the end (which old virt-v2v did not do).  I can
see from the log that flush is supported by this imageio:

  Dec 13 22:00:09 nbdkit: debug: imageio features: flush=True zero=True unix_socket=None max_readers=8 max_writers=8

and therefore we would be using the PATCH operation with 'flush' as
true which we probably never did before now:

  https://github.com/libguestfs/virt-v2v/blob/1673fc4b640f754fbdc942a6636fd0e95b996144/output/rhv-upload-plugin.py#L235

Nir: Any ideas about this one?

Ming: Do we have any logs from the server side (imageio I think)
from when the error occurs?  Nir may be able to help locate the
right file.

Comment 3 mxie@redhat.com 2021-12-14 13:55:43 UTC

Convert guest 'esx6.5-win2019-x86_64' to reproduce the bug again, please check vdsm.log and daemon.log

Comment 6 Nir Soffer 2021-12-14 14:45:17 UTC

(In reply to mxie from comment #0)

> nbdkit: python.5: error: /tmp/v2v.DIBlsF/rhv-upload-plugin.py: pwrite:
> error: Traceback (most recent call last):
>    File "/tmp/v2v.DIBlsF/rhv-upload-plugin.py", line 213, in pwrite
>     r = http.getresponse()
>    File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
>     response.begin()
>    File "/usr/lib64/python3.9/http/client.py", line 320, in begin
>     version, status, reason = self._read_status()
>    File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
>     raise RemoteDisconnected("Remote end closed connection without"
>  http.client.RemoteDisconnected: Remote end closed connection without
> response

Looks like imageio closed the connection. This may happen if you send
invalid request and we already sent a response, for example in pread()
but I don't see how this can happen in pwrite().

We need imageio logs for this transfer, and the transfer id.

> nbdkit: python.6: error: /tmp/v2v.DIBlsF/rhv-upload-plugin.py: zero: error:
> Traceback (most recent call last):
>    File "/tmp/v2v.DIBlsF/rhv-upload-plugin.py", line 242, in zero
>     http.request("PATCH", url.path, body=buf, headers=headers)
>    File "/usr/lib64/python3.9/http/client.py", line 1285, in request
>     self._send_request(method, url, body, headers, encode_chunked)
>    File "/usr/lib64/python3.9/http/client.py", line 1296, in _send_request
>     self.putrequest(method, url, **skips)
>    File "/usr/lib64/python3.9/http/client.py", line 1122, in putrequest
>     raise CannotSendRequest(self.__state)
>  http.client.CannotSendRequest: Request-sent

This means you tried to send a request without consuming the previous
request response.

> Additional info:
> 1. Virt-v2v can convert guests from ESXi6.5 to rhv via vddk6.5 + rhv-upload
> successfully

Is this same version of virt-v2v? How vddk version affects the 
flow?

> 2.Similar problem has been fixed in bug1939375

This is only similar symptom. That bug has no info about the underlying
issue and how it was fixed, so it is not helpful in any way.

Comment 7 Richard W.M. Jones 2021-12-14 14:57:00 UTC

> > nbdkit: python.6: error: /tmp/v2v.DIBlsF/rhv-upload-plugin.py: zero: error:
> > Traceback (most recent call last):

This looks like a consequence of the earlier failure.  I would ignore it.

> > 1. Virt-v2v can convert guests from ESXi6.5 to rhv via vddk6.5 + rhv-upload
> > successfully
> 
> Is this same version of virt-v2v? How vddk version affects the 
> flow?

The input and output sides of virt-v2v are separate so normally I'd say
that the version of VDDK (on the input side) couldn't affect the output
side.  However in this particular case, VDDK 6.5 didn't support extents
or asynch reads, whereas VDDK >= 6.7 does (both).  This would allow us
to copy more aggressively (and hence write to imagio more aggressively).
That might cause some difference I suppose.

> > 2.Similar problem has been fixed in bug1939375

Yes that bug seemed to have "fixed itself", but maybe it's the same thing
as this bug.  Anyway let's concentrate on this one.

Nir, what files should mxie retrieve for you in order to get all the
details needed?

Comment 8 Nir Soffer 2021-12-14 21:29:25 UTC

(In reply to Richard W.M. Jones from comment #2)
...
> However it could happen because in new virt-v2v we are actually
> flushing out the data at the end (which old virt-v2v did not do).

Old virt-v2v did flush, actually twice instead of once. Base on imageio
log, we flush now exactly once per connection (better).

> I can
> see from the log that flush is supported by this imageio:
> 
>   Dec 13 22:00:09 nbdkit: debug: imageio features: flush=True zero=True
> unix_socket=None max_readers=8 max_writers=8
> 
> and therefore we would be using the PATCH operation with 'flush' as
> true which we probably never did before now:
> 
>  
> https://github.com/libguestfs/virt-v2v/blob/
> 1673fc4b640f754fbdc942a6636fd0e95b996144/output/rhv-upload-plugin.py#L235
> 
> Nir: Any ideas about this one?

In the attached log we see one transfer:

Adding a ticket:

2021-12-14 18:29:50,989 INFO    (Thread-38759) [tickets] [local] ADD ticket={'dirty': False, 'ops': ['write'], 'size': 21474836480, 'sparse': True, 'transfer_id': 'ab681da0-9d2d-4fb0-8bfd-c85e9096e7b1', 'uuid': 'bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'}

virt-v2v sent OPTIONS request and closing the connection:

2021-12-14 18:29:52,346 INFO    (Thread-38761) [images] [::ffff:10.73.224.92] OPTIONS ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c
2021-12-14 18:29:52,347 INFO    (Thread-38761) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8
2021-12-14 18:29:52,350 INFO    (Thread-38761) [backends.nbd] Close backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'
2021-12-14 18:29:52,350 INFO    (Thread-38761) [http] CLOSE connection=38761 client=::ffff:10.73.224.92 [connection 1 ops, 0.003870 s] [dispatch 1 ops, 0.000797 s]

I think this is the initial connection used to probe the image.

Then it open 4 new connections:

2021-12-14 18:30:26,359 INFO    (Thread-38767) [http] OPEN connection=38767 client=::ffff:10.73.224.92
2021-12-14 18:30:26,360 INFO    (Thread-38767) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8

2021-12-14 18:30:26,370 INFO    (Thread-38768) [http] OPEN connection=38768 client=::ffff:10.73.224.92
2021-12-14 18:30:26,371 INFO    (Thread-38768) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8

2021-12-14 18:30:26,433 INFO    (Thread-38769) [http] OPEN connection=38769 client=::ffff:10.73.224.92
2021-12-14 18:30:26,434 INFO    (Thread-38769) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8

2021-12-14 18:30:26,443 INFO    (Thread-38770) [http] OPEN connection=38770 client=::ffff:10.73.224.92
2021-12-14 18:30:26,444 INFO    (Thread-38770) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8

We don't have any info on the next 2 minutes, since we don't log every
request in INFO log level. We can enable DEBUG logs to see all requests.

Client did not write anything for 60 seconds - imageio closes the connection:

2021-12-14 18:32:26,615 WARNING (Thread-38767) [http] Timeout reading or writing to socket: The read operation timed out
2021-12-14 18:32:26,615 INFO    (Thread-38767) [backends.nbd] Close backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'
2021-12-14 18:32:26,616 INFO    (Thread-38767) [http] CLOSE connection=38767 client=::ffff:10.73.224.92 [connection 1 ops, 120.255967 s] [dispatch 343 ops, 23.193331 s] [zero 191 ops, 10.383869 s, 7.01 MiB, 691.07 KiB/s] [zero.zero 191 ops, 10.380493 s, 7.01 MiB, 691.30 KiB/s] [write 151 ops, 12.653948 s, 24.84 MiB, 1.96 MiB/s] [write.read 151 ops, 0.323650 s, 24.84 MiB, 76.74 MiB/s] [write.write 151 ops, 12.324584 s, 24.84 MiB, 2.02 MiB/s]

This connection did write some data and zero in very low rate, but it seems
that it did not send any request since 2021-12-14 18:31:26,615, so imageio
treat this connection as stale and close the connection.

Why nbdcopy did not send any request for 60 seconds?

This looks like connection closed by the client - maybe when cleaning up after
the error?

2021-12-14 18:32:26,642 INFO    (Thread-38770) [backends.nbd] Close backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'
2021-12-14 18:32:26,642 INFO    (Thread-38770) [http] CLOSE connection=38770 client=::ffff:10.73.224.92 [connection 1 ops, 120.198443 s] [dispatch 324 ops, 23.272728 s] [zero 162 ops, 9.586599 s, 192.50 MiB, 20.08 MiB/s] [zero.zero 162 ops, 9.583858 s, 192.50 MiB, 20.09 MiB/s] [write 161 ops, 13.546975 s, 21.94 MiB, 1.62 MiB/s] [write.read 161 ops, 0.302912 s, 21.94 MiB, 72.43 MiB/s] [write.write 161 ops, 13.238212 s, 21.94 MiB, 1.66 MiB/s]

Another connection times out...

2021-12-14 18:32:26,679 WARNING (Thread-38769) [http] Timeout reading or writing to socket: The read operation timed out
2021-12-14 18:32:26,679 INFO    (Thread-38769) [backends.nbd] Close backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'
2021-12-14 18:32:26,680 INFO    (Thread-38769) [http] CLOSE connection=38769 client=::ffff:10.73.224.92 [connection 1 ops, 120.245773 s] [dispatch 325 ops, 23.361805 s] [write 175 ops, 14.774178 s, 28.53 MiB, 1.93 MiB/s] [write.read 175 ops, 0.337966 s, 28.53 MiB, 84.42 MiB/s] [write.write 175 ops, 14.429570 s, 28.53 MiB, 1.98 MiB/s] [zero 149 ops, 8.443718 s, 58.82 MiB, 6.97 MiB/s] [zero.zero 149 ops, 8.441066 s, 58.82 MiB, 6.97 MiB/s]

And last connection times out...

2021-12-14 18:32:26,695 WARNING (Thread-38768) [http] Timeout reading or writing to socket: The read operation timed out
2021-12-14 18:32:26,695 INFO    (Thread-38768) [backends.nbd] Close backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'
2021-12-14 18:32:26,696 INFO    (Thread-38768) [http] CLOSE connection=38768 client=::ffff:10.73.224.92 [connection 1 ops, 120.325266 s] [dispatch 339 ops, 23.404887 s] [zero 167 ops, 9.125880 s, 395.62 MiB, 43.35 MiB/s] [zero.zero 167 ops, 9.122824 s, 395.62 MiB, 43.37 MiB/s] [write 171 ops, 14.135474 s, 18.56 MiB, 1.31 MiB/s] [write.read 171 ops, 0.268523 s, 18.56 MiB, 69.13 MiB/s] [write.write 171 ops, 13.860308 s, 18.56 MiB, 1.34 MiB/s]

Look like nbdcopy did not read anything from the input side for 60
seconds, or was blockedc on something for 60 seconds, since 3 connections
show timeout on the imageio side.

Client opens 3 new connections - this is likely python reopening the connection
transparently after the connection was closed:

2021-12-14 18:32:38,815 INFO    (Thread-38786) [http] OPEN connection=38786 client=::ffff:10.73.224.92
2021-12-14 18:32:38,816 INFO    (Thread-38786) [images] [::ffff:10.73.224.92] OPTIONS ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c
2021-12-14 18:32:38,817 INFO    (Thread-38786) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8

2021-12-14 18:32:38,826 INFO    (Thread-38787) [http] OPEN connection=38787 client=::ffff:10.73.224.92
2021-12-14 18:32:38,826 INFO    (Thread-38787) [images] [::ffff:10.73.224.92] OPTIONS ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c
2021-12-14 18:32:38,826 INFO    (Thread-38787) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8

2021-12-14 18:32:38,836 INFO    (Thread-38788) [http] OPEN connection=38788 client=::ffff:10.73.224.92
2021-12-14 18:32:38,837 INFO    (Thread-38788) [images] [::ffff:10.73.224.92] OPTIONS ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c
2021-12-14 18:32:38,837 INFO    (Thread-38788) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8

Finally client sends FLUSH request:

2021-12-14 18:44:22,974 INFO    (Thread-38788) [images] [::ffff:10.73.224.92] FLUSH ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c

Based on virt-v2v code, we should get one FLUSH per connection,
but we got only one FLUSH - bug?

And close the connections:

2021-12-14 18:44:23,598 INFO    (Thread-38786) [backends.nbd] Close backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'
2021-12-14 18:44:23,599 INFO    (Thread-38786) [http] CLOSE connection=38786 client=::ffff:10.73.224.92 [connection 1 ops, 704.782808 s] [dispatch 204 ops, 19.928551 s] [write 140 ops, 18.041646 s, 300.99 MiB, 16.68 MiB/s] [write.read 140 ops, 4.821075 s, 300.99 MiB, 62.43 MiB/s] [write.write 140 ops, 13.213603 s, 300.99 MiB, 22.78 MiB/s] [zero 63 ops, 1.788593 s, 155.05 MiB, 86.69 MiB/s] [zero.zero 63 ops, 1.787386 s, 155.05 MiB, 86.75 MiB/s]

2021-12-14 18:44:23,601 INFO    (Thread-38787) [backends.nbd] Close backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'
2021-12-14 18:44:23,602 INFO    (Thread-38787) [http] CLOSE connection=38787 client=::ffff:10.73.224.92 [connection 1 ops, 704.775691 s] [dispatch 203 ops, 22.131527 s] [write 138 ops, 19.951997 s, 298.70 MiB, 14.97 MiB/s] [write.read 138 ops, 5.318208 s, 298.70 MiB, 56.16 MiB/s] [write.write 138 ops, 14.627211 s, 298.70 MiB, 20.42 MiB/s] [zero 64 ops, 2.080120 s, 210.13 MiB, 101.02 MiB/s] [zero.zero 64 ops, 2.078934 s, 210.13 MiB, 101.08 MiB/s]

2021-12-14 18:44:23,731 INFO    (Thread-38788) [backends.nbd] Close backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'
2021-12-14 18:44:23,731 INFO    (Thread-38788) [http] CLOSE connection=38788 client=::ffff:10.73.224.92 [connection 1 ops, 704.894871 s] [dispatch 203 ops, 21.876973 s] [write 138 ops, 19.866507 s, 297.00 MiB, 14.95 MiB/s] [write.read 138 ops, 5.192211 s, 297.00 MiB, 57.20 MiB/s] [write.write 138 ops, 14.668017 s, 297.00 MiB, 20.25 MiB/s] [zero 63 ops, 1.910213 s, 310.27 MiB, 162.43 MiB/s] [zero.zero 63 ops, 1.909048 s, 310.27 MiB, 162.53 MiB/s] [flush 1 ops, 0.000254 s]

So the root cause is inactive client, and imageio behaves as designed.


To understand better why we have 60 second delay in nbdcopy, we should reproduce
again with DEBUG log level in imageio:

Create a drop-in file:

$ cat /etc/ovirt-imageio/conf.d/99-local.conf
[logger_root]
level = DEBUG

And restart the service:

    systemctl restart ovirt-imageio.service


If we cannot avoid the delays in nbdcopy, we can increase the server timeout.
It is not configurable now, and can be change in the source:

$ git diff -U5
diff --git a/ovirt_imageio/_internal/http.py b/ovirt_imageio/_internal/http.py
index 9feaec0..b70058a 100644
--- a/ovirt_imageio/_internal/http.py
+++ b/ovirt_imageio/_internal/http.py
@@ -170,11 +170,11 @@ class Connection(http.server.BaseHTTPRequestHandler):
     max_request_line = 4096

     # Number of second to wait for recv() or send(). When the timeout expires
     # we close the connection. This is important when working with clients that
     # keep the connection open after upload or download (e.g browsers).
-    timeout = 60
+    timeout = 120

     # For generating connection ids. Start from 1 to match the connection
     # thread name.
     _counter = itertools.count(1)


I can provide a build with configurable timeout for testing, but I think we
need to understand why nbdcopy have such delays. We never had such delays
in the previous implementation when we tested IMS.

Comment 9 Nir Soffer 2021-12-14 21:59:49 UTC

(In reply to mxie from comment #1)
> Created attachment 1846206 [details]
> virt-v2v-1.45.94-3-esxi6.5-vddk6.7.log

This log does not match the daemon log posted later:

$ grep transfer_id *.log
daemon.log:2021-12-14 18:29:50,989 INFO    (Thread-38759) [tickets] [local] ADD ticket={'dirty': False, 'ops': ['write'], 'size': 21474836480, 'sparse': True, 'transfer_id': 'ab681da0-9d2d-4fb0-8bfd-c85e9096e7b1', 'uuid': 'bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'}
virt-v2v-1.45.94-3-esxi6.5-vddk6.7.log:Dec 13 22:00:08     "transfer_id": "1d3c6862-23bd-4946-a4c0-5a6c0b8eae05"

When you reproduce again with DEBUG logs (see end of comment 8), please include
both virt-v2v log and daemon.log for the same transfer.

Comment 10 Laszlo Ersek 2021-12-15 10:35:12 UTC

Two (not very smart) comments:

(In reply to Nir Soffer from comment #6)

> > nbdkit: python.6: error: /tmp/v2v.DIBlsF/rhv-upload-plugin.py: zero: error:
> > Traceback (most recent call last):
> >    File "/tmp/v2v.DIBlsF/rhv-upload-plugin.py", line 242, in zero
> >     http.request("PATCH", url.path, body=buf, headers=headers)
> >    File "/usr/lib64/python3.9/http/client.py", line 1285, in request
> >     self._send_request(method, url, body, headers, encode_chunked)
> >    File "/usr/lib64/python3.9/http/client.py", line 1296, in _send_request
> >     self.putrequest(method, url, **skips)
> >    File "/usr/lib64/python3.9/http/client.py", line 1122, in putrequest
> >     raise CannotSendRequest(self.__state)
> >  http.client.CannotSendRequest: Request-sent
> 
> This means you tried to send a request without consuming the previous
> request response.

Does this refer to a single connection, or to mutliple connections together? (Sorry if my question is inexact, I know nothing about imageio.) I can understand if request/response is expected in lock-step on a single connection, but if multiple connections are used to transfer an image in some parallel fashion, then a pending response on a connection should not force another connection to delay a request. (Again I apologize if this is ridiculously naive; perhaps imageio already does this.)

(In reply to Nir Soffer from comment #8)

> Why nbdcopy did not send any request for 60 seconds?
> [...]
> So the root cause is inactive client, and imageio behaves as designed.

Rich's comment 7 indicates that on the input side of virt-v2v, vddk-6.7 should be *more* agressive than vddk-6.5 (which does not produce this symptom), hence the push to imageio could be also more aggressive. However, I can imagine an opposite outcome: because with vddk-6.7 we massage the vmware server more vigorously, perhaps that exactly leads to some lock contention or other bottleneck in either the vmware image server, or the vddk library in virt-v2v. And if the input side of virt-v2v is blocked for whatever reason for a long time, that will cause silence on the output side too.

If we can catch this long period of silence "in the act", then forcing a core dump of the input nbdkit might be informative -- see where the nbdkit threads are stuck.

Just my two cents. Thanks!

Comment 11 Richard W.M. Jones 2021-12-15 10:36:39 UTC

> Look like nbdcopy did not read anything from the input side for 60
> seconds, or was blockedc on something for 60 seconds, since 3 connections
> show timeout on the imageio side.

nbdcopy could easily not send anything for 60 seconds or longer if the
input side is slow.  In this case the most likely explanation is simply
that we are reading extent information from VDDK, a known slow operation.
The difference from previous virt-v2v is we used qemu-img convert before.
qemu-img convert maps the input disk up front and then sends all the data,
nbdcopy maps the disk as it goes along.  Neither behaviour is wrong, just
different, but plenty of people complained about the qemu-img behaviour
because it looked as if virt-v2v was "doing nothing" (copying stuck at 0%)
for a long time, so that was why I changed it.

However I'm not convinced this is really the problem.  The error is:

  Remote end closed connection without response

Looking at the code in /usr/lib64/python3.10/http/client.py this error
seems to be happening when we have sent an HTTP request and while reading
the HTTP status line (before even the response headers) the connection is
closed abruptly.  HTTP connections are not supposed to be closed there
as far as I'm aware.  Could it be that the imageio timeout applies in
the wrong place?

> Based on virt-v2v code, we should get one FLUSH per connection,
> but we got only one FLUSH - bug?

It depends!  nbdcopy only has a single NBD connection open, and only
sends one NBD_CMD_FLUSH.  You are correct that the plugin will turn
this into one flush request per HTTP connection.  It seems like the
failure occurs during this loop on the client side, but I'm not clear why.

If it's not the reason that I gave above, maybe we need to check (somehow?)
if each connection is still open before sending the flush, or simply
ignore errors here.

Comment 12 Richard W.M. Jones 2021-12-15 10:40:17 UTC

> Rich's comment 7 indicates that on the input side of virt-v2v, vddk-6.7 should be *more*
> agressive than vddk-6.5 (which does not produce this symptom), hence the push to imageio
> could be also more aggressive.

It's a bit of both.  VDDK >= 6.7 supports asynch reads which lets us read data faster.
But also VDDK >= 6.7 supports extents, so that means we use the (slow)
VixDiskLib_QueryAllocatedBlocks call to read extents during the copy, and that block us
for seconds.

I hate timeouts BTW.  Why do we even have them in imageio?

Comment 13 Laszlo Ersek 2021-12-15 11:25:25 UTC

(In reply to Richard W.M. Jones from comment #12)

> I hate timeouts BTW.  Why do we even have them in imageio?

I think all servers must have them; otherwise they could be easily DoS'd (intentionally or not). Sometimes the client just vanishes (network glitch?), and TCP-level keepalives (if enabled) often take too long to time out.

Comment 14 Nir Soffer 2021-12-15 12:36:36 UTC

(In reply to Laszlo Ersek from comment #10)
> Two (not very smart) comments:
> 
> (In reply to Nir Soffer from comment #6)
> 
> > > nbdkit: python.6: error: /tmp/v2v.DIBlsF/rhv-upload-plugin.py: zero: error:
> > > Traceback (most recent call last):
> > >    File "/tmp/v2v.DIBlsF/rhv-upload-plugin.py", line 242, in zero
> > >     http.request("PATCH", url.path, body=buf, headers=headers)
> > >    File "/usr/lib64/python3.9/http/client.py", line 1285, in request
> > >     self._send_request(method, url, body, headers, encode_chunked)
> > >    File "/usr/lib64/python3.9/http/client.py", line 1296, in _send_request
> > >     self.putrequest(method, url, **skips)
> > >    File "/usr/lib64/python3.9/http/client.py", line 1122, in putrequest
> > >     raise CannotSendRequest(self.__state)
> > >  http.client.CannotSendRequest: Request-sent
> > 
> > This means you tried to send a request without consuming the previous
> > request response.
> 
> Does this refer to a single connection, or to mutliple connections together?
> (Sorry if my question is inexact, I know nothing about imageio.) I can
> understand if request/response is expected in lock-step on a single
> connection, but if multiple connections are used to transfer an image in
> some parallel fashion, then a pending response on a connection should not
> force another connection to delay a request. (Again I apologize if this is
> ridiculously naive; perhaps imageio already does this.)

imageio is pretty naive and does not consider other connections when closing
a single connection because a client is inactive. But closing one connection
does not affect other connections.

In virt-v2v case, the work should spread evenly on all clients, since we use
a queue to manage connections.

When pwrite()/zero() are called we do:

    with http_context(pool) as http:
        send request

http_context() returns the first connection in the queue and when done return
it to the end of the queue:

    @contextmanager
    def http_context(pool):
        http = pool.get()
        try:
            yield http
        finally:
            pool.put(http)

So should have one connection that never get work and disconnected.

It is possible to write a bad client application that does not spread work
evenly on all connections, I'm not sure we should spend time supporting 
such case.

> (In reply to Nir Soffer from comment #8)
> 
> > Why nbdcopy did not send any request for 60 seconds?
> > [...]
> > So the root cause is inactive client, and imageio behaves as designed.
> 
> Rich's comment 7 indicates that on the input side of virt-v2v, vddk-6.7
> should be *more* agressive than vddk-6.5 (which does not produce this
> symptom), hence the push to imageio could be also more aggressive. However,
> I can imagine an opposite outcome: because with vddk-6.7 we massage the
> vmware server more vigorously, perhaps that exactly leads to some lock
> contention or other bottleneck in either the vmware image server, or the
> vddk library in virt-v2v. And if the input side of virt-v2v is blocked for
> whatever reason for a long time, that will cause silence on the output side
> too.
> 
> If we can catch this long period of silence "in the act", then forcing a
> core dump of the input nbdkit might be informative -- see where the nbdkit
> threads are stuck.

I agree that the next step, understand what nbdcopy or vddk plugin are doing
when this happens.

Adding DEBUG logs in imageio will make it more clear since we log every request.
In INFO level we log only opening a connection, sending OPTIONS (like nbd handshake)
flushing, and closing the connection.

Comment 15 Nir Soffer 2021-12-15 12:43:22 UTC

(More info for comment 14)

rhv-output-plugin http pool:
https://github.com/libguestfs/virt-v2v/blob/1673fc4b640f754fbdc942a6636fd0e95b996144/output/rhv-upload-plugin.py#L316

Using FIFO queue:
https://docs.python.org/3/library/queue.html#queue.Queue

Comment 16 Nir Soffer 2021-12-15 13:08:45 UTC

(In reply to Richard W.M. Jones from comment #11)
> > Look like nbdcopy did not read anything from the input side for 60
> > seconds, or was blockedc on something for 60 seconds, since 3 connections
> > show timeout on the imageio side.
> 
> nbdcopy could easily not send anything for 60 seconds or longer if the
> input side is slow.  In this case the most likely explanation is simply
> that we are reading extent information from VDDK, a known slow operation.
> The difference from previous virt-v2v is we used qemu-img convert before.
> qemu-img convert maps the input disk up front and then sends all the data,
> nbdcopy maps the disk as it goes along.  Neither behaviour is wrong, just
> different, but plenty of people complained about the qemu-img behaviour
> because it looked as if virt-v2v was "doing nothing" (copying stuck at 0%)
> for a long time, so that was why I changed it.

Mapping the image as we got should not lead to such delays, it should be 
have much lower delays compared to qemu-img, since we don't map the entire
image upfront.

nbdcopy uses 4 worker threads, each doing something like:

   while True:
       get next offset
       return if done

       get extents for range offset - offset + 128m
       for extent in extents:
       if extent is zero:
           zero range
       else:
           read data range
           split read data to zero/data block
           for block in read data:
           if zero:
               zero range
           else:
               write data

So we expect a small delay when reading the next 128m extents.
 
> However I'm not convinced this is really the problem.  The error is:
> 
>   Remote end closed connection without response
> 
> Looking at the code in /usr/lib64/python3.10/http/client.py this error
> seems to be happening when we have sent an HTTP request and while reading
> the HTTP status line (before even the response headers) the connection is
> closed abruptly.  HTTP connections are not supposed to be closed there
> as far as I'm aware.  Could it be that the imageio timeout applies in
> the wrong place?

The server thread is doing:

    while True:
        read status line
        read headers
        perform  work (may read body) ...

When client is inactive, the thread is waiting reading the status line, so 
this is the place we can use a timeout.

Where do you suggest to have the timeout?

> > Based on virt-v2v code, we should get one FLUSH per connection,
> > but we got only one FLUSH - bug?
> 
> It depends!  nbdcopy only has a single NBD connection open, and only
> sends one NBD_CMD_FLUSH.  You are correct that the plugin will turn
> this into one flush request per HTTP connection.  It seems like the
> failure occurs during this loop on the client side, but I'm not clear why.

We need to debug this, not qemu-nbd does not ensure that one FLUSH will flush
everything.

> If it's not the reason that I gave above, maybe we need to check (somehow?)
> if each connection is still open before sending the flush, or simply
> ignore errors here.

You cannot ignore the errors, connections should not be closed in normal
condition.

It may be possible to handle closed connection transparently, but I'm not
sure it is safe. When imageio close the clinet connection, it also close the
underling nbd connection to qemu-nbd, and it does not flush anything
since we have no way to report flush failures, so theoretically, you may loose
data written to the old connection.

It will be better to be more strict and disable python automatic reopen of
connections - this can be done using:

    http = HTTPSConnection(...)
    http.connect()
    http.auto_open = False

The flag is not documented, but is public, so it is unlikely to change.

Comment 17 Richard W.M. Jones 2021-12-15 13:28:07 UTC

> nbdcopy uses 4 worker threads, each doing something like:

In this case nbdcopy will be using the multi-thread loop, but with only one
thread (because the input side does not support multi-conn).  But the
rest is right.

However don't underestimate how slow VDDK is for mapping extents!  In fact
we now collect stats on this (see very end of the virt-v2v log):

Dec 13 22:32:13 nbdkit: debug: VixDiskLib_...                        µs calls           bytes
Dec 13 22:32:13 nbdkit: debug:   QueryAllocatedBlocks        1607496548  3643
Dec 13 22:32:13 nbdkit: debug:   ReadAsync                    167574880 20174      6681559040
Dec 13 22:32:13 nbdkit: debug:   GetInfo                        6599017     5

Notice that on average each QueryAllocatedBlocks call is taking ~ 1/2 a second,
(also overlapping them is not supported), and those could add up (we require
several calls per work block, and maybe some calls take longer than others).

> When client is inactive, the thread is waiting reading the status line, so
> this is the place we can use a timeout.

If the timeout is only at the point where you read the next request
from the client side then that is fine.  However the error message
seems to indicate that the timeout actually happens after the client
has sent the request and is waiting for the 200 status message in
reply (ie. at the point where imageio would be doing work).

>  http = HTTPSConnection(...)
>  http.connect()
>  http.auto_open = False

What's the implication of doing this?  Surely it means that there might
be a delay somewhere, eg. on the input side, and then some HTTP connections
in the pool would be closed, and at that point we'd fail, but for no real
reason.

Basically we cannot assume that VDDK can deliver data as fast as imageio
can consume it, for all kinds of reasons, such as VDDK sucking, the network
might be slow, we have the retry filter on the input side, etc.

Comment 18 Nir Soffer 2021-12-15 13:42:24 UTC

(In reply to Richard W.M. Jones from comment #17)
> > nbdcopy uses 4 worker threads, each doing something like:
> 
> In this case nbdcopy will be using the multi-thread loop, but with only one
> thread (because the input side does not support multi-conn).  But the
> rest is right.
> 
> However don't underestimate how slow VDDK is for mapping extents!  In fact
> we now collect stats on this (see very end of the virt-v2v log):
> 
> Dec 13 22:32:13 nbdkit: debug: VixDiskLib_...                        µs
> calls           bytes
> Dec 13 22:32:13 nbdkit: debug:   QueryAllocatedBlocks        1607496548  3643
> Dec 13 22:32:13 nbdkit: debug:   ReadAsync                    167574880
> 20174      6681559040
> Dec 13 22:32:13 nbdkit: debug:   GetInfo                        6599017     5
> 
> Notice that on average each QueryAllocatedBlocks call is taking ~ 1/2 a
> second,
> (also overlapping them is not supported), and those could add up (we require
> several calls per work block, and maybe some calls take longer than others).

If mapping extents is slow slow that we can have 60 seconds timeout in imageio
side, it is likey to be faster to read the data and detect zeroes in nbdcopy.

We really need log with timestamps for nbdcopy or vddk to understand whats going
on.
 
> > When client is inactive, the thread is waiting reading the status line, so
> > this is the place we can use a timeout.
> 
> If the timeout is only at the point where you read the next request
> from the client side then that is fine.  However the error message
> seems to indicate that the timeout actually happens after the client
> has sent the request and is waiting for the 200 status message in
> reply (ie. at the point where imageio would be doing work).

We don't have any other timeout. The timeout is used only when reading
from the client socket.

> >  http = HTTPSConnection(...)
> >  http.connect()
> >  http.auto_open = False
> 
> What's the implication of doing this? 

With this setting, if a connection is closed by the server side, python will
not try to open a new connection automatically, and you will fail when trying
to use the connection.

This ensure that closed connection will fail the transfer since in this
case you may lose data written using the close connections that was never
flushed.

Comment 19 Richard W.M. Jones 2021-12-15 13:58:53 UTC

Note to self that Nir pointed out that the log is actually full of
earlier I/O errors which appear to be ignored or possibly silently
retried.  He also sent me instructions for how to just run imageio
on its own which will be useful for trying to reproduce this locally.
I'm not going to be able to deal with this until back next year.

Comment 20 Nir Soffer 2021-12-15 14:34:24 UTC

Adding here the instructions how to test with a standalone imageio server
(no oVirt setup needed).

Clone imageio source - it is easier to test with upstream:

    git clone https://github.com/oVirt/ovirt-imageio.git

Shell 1 - running imageio server
---------------------------------

1. Build

    $ make

2. Start imageio server using debug configuration

    $ ./ovirt-imageio -c test

This server logs to stderr for easier debugging.

Shell 2 - running qemu-nbd
--------------------------

1. Create target image

    qemu-img create -f qcow2 /var/tmp/test.qcow2 6g

2. Install a ticket for this transfer

Copy the example ticket in examples/nbd.json and modify it as needed:

    $ cat myticket.json
    {
        "uuid": "myticket",
        "size": 6442450944,
        "url": "nbd:unix:/tmp/nbd.sock",
        "timeout": 3000,
        "sparse": true,
        "ops": ["read", "write"]
    }

    ./ovirt-imageioctl add-ticket -c test myticket.json

3. Start qemu-nbd to serve this ticket

    $ qemu-nbd -t -e8 -A -f qcow2 -k /tmp/nbd.sock /var/tmp/test.qcow2

At this point you can upload or download with this transfer url:

    https://localhost:54322/images/myticket

You should use the test certificate in the client:

    test/pki/system/ca.pem

Comment 21 Richard W.M. Jones 2021-12-15 16:34:39 UTC

Thanks - I did a very simple test.  I set up an imageio server exactly as
you describe above.

I then ran nbdkit (from upstream) using the virt-v2v plugin directly:

$ ./nbdkit -fv python ~/d/virt-v2v/output/rhv-upload-plugin.py url=https://localhost:54322/images/myticket insecure=1 size=6442450944 &

and ran nbdsh:

$ nbdsh -u nbd://localhost

Now I am able to type nbdsh commands and see how the nbdkit debug messages and
the imageio debug messages all at the same time.  Quite interesting.

Commands work OK, eg:

nbd> buf = h.pread(512,0)
nbdkit: python.0: debug: python: pread count=512 offset=0
nbd> print(buf)
b'\x00\x00\x00 [etc]

nbd> h.zero(512,0)
nbdkit: python.11: debug: python: zero count=512 offset=0 may_trim=1 fua=0 fast=0

After waiting for approx 60 seconds I see that imageio closes all four connections
with a message like this (four times):

2021-12-15 16:30:49,479 INFO    (Thread-9 (process_request_thread)) [http] CLOSE connection=9 client=::1 [connection 1 ops, 98.183511 s] [dispatch 5 ops, 0.020925 s] [read 3 ops, 0.001616 s, 1.50 KiB, 928.20 KiB/s] [read.read 3 ops, 0.001148 s, 1.50 KiB, 1.28 MiB/s] [read.write 3 ops, 0.000374 s, 1.50 KiB, 3.91 MiB/s] [zero 2 ops, 0.016304 s, 1.00 KiB, 61.33 KiB/s] [zero.zero 2 ops, 0.016263 s, 1.00 KiB, 61.49 KiB/s]

Now all calls return errors:

nbd> h.zero(512,0)
nbdkit: python.8: debug: python: zero count=512 offset=0 may_trim=1 fua=0 fast=0
nbdkit: python.8: error: /home/rjones/d/virt-v2v/output/rhv-upload-plugin.py: zero: error: Traceback (most recent call last):
[...]

(The error arrives in nbdsh, so we're not losing it)

Sometimes, I'm not clear when or why, nbdkit seems to recover.

Anyway I think this is indeed a problem in virt-v2v, in rhv-upload-plugin and/or
with the Python HTTP client.

Now I have a nice local reproducer I will try to see how I can get Python to
automatically recover.

Comment 22 Richard W.M. Jones 2021-12-15 17:28:49 UTC

Summary of discussion on IRC:

 - We will try to use a longer timeout in imageio, 10 mins suggested.

 - It's still worth investigating why it is that http.client cannot recover
   from a closed connection and if it's possible to modify the client side
   to make it recover.

 - Discussion of using higher-level client libraries (Python Requests was mentioned).
   However we previously dropped this because of performance problems.

Comment 23 Nir Soffer 2021-12-15 18:23:22 UTC

I plan to improve the imageio side by supporting configurable socket 
timeout, based on the already exposed inactivity_timeout property of
the image transfer.

See https://github.com/oVirt/ovirt-imageio/issues/14

I guess we want to have a RHV bug for tracking this.

Comment 24 Nir Soffer 2021-12-15 19:10:12 UTC

I think the root cause is changing virt-v2v to open the output side
*before* converting the overlay.

In the past virt-v2v did:

1. open the vddk input
2. convert the image using an overlay file
4. run qemu-img convert with nbdkit rhv-output plugin to transfer the image

Now it does:

1. open the vddk input
2. open the nbdkit rhv-output plugin
3. convert the image using an overlay file
4. run nbdcopy to transfer the data from the input (vddk) to otuoput (rhv-upload)

Converting the image takes lot of time, and the imageio http connections time
out during the conversion.

Fixing the timeout on imageio side can solve this issue for RHV 4.4.10. But
when working with current RHV we cannot fix this in imageio.

This must be fixed in virt-v2v if we want to support older RHV version.

Comment 25 Richard W.M. Jones 2021-12-15 19:37:15 UTC

We can't really change this now.  We will have to make the plugin reconnect
if the HTTP session drops (IOW how HTTP is normally supposed to work).

Comment 26 Nir Soffer 2021-12-15 19:49:37 UTC

(In reply to Richard W.M. Jones from comment #25)
> We can't really change this now.  We will have to make the plugin reconnect
> if the HTTP session drops (IOW how HTTP is normally supposed to work).

I don't think it will be safe to reconnect a dropped connection, since
qemu-nbd does not support MULTI_CONN yet for writable server.

So what can be done safely on virt-v2v side:

- option 1: open the imageio nbdkit only when the overlay is ready

- option 2: avoid server timeout by closing idle connections before the 
  server time out, and flushing every connection before closing.

- option 3: close the initial connection to imageio, open the http
  pool when we get the first real request from the nbdcopy.

Option 3 is what we did in the past to avoid the same issue, see:
https://github.com/libguestfs/virt-v2v/commit/1d5fc257765c444644e5bfc6525e86ff201755f0

Comment 27 Nir Soffer 2021-12-15 20:45:17 UTC

Previously I ommitted some info from the log which looked like an unrelated bug.

When rhv-upload-plugin connect to imageio, it opens one connection to get
the server options, and then close the connection, and create pool of conections.
So we expect to see single OPTIONS request per transfer.

But in the attach daemon.log we see:

The initial connection:

2021-12-14 18:29:52,346 INFO    (Thread-38761) [http] OPEN connection=38761 client=::ffff:10.73.224.92
2021-12-14 18:29:52,346 INFO    (Thread-38761) [images] [::ffff:10.73.224.92] OPTIONS ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c
2021-12-14 18:29:52,347 INFO    (Thread-38761) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8
2021-12-14 18:29:52,350 INFO    (Thread-38761) [backends.nbd] Close backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock'
2021-12-14 18:29:52,350 INFO    (Thread-38761) [http] CLOSE connection=38761 client=::ffff:10.73.224.92 [connection 1 ops, 0.003870 s] [dispatch 1 ops, 0.000797 s]

This is expected, called from:

 79 def after_fork():
 80     global options, pool
 81 
 82     http = create_http(url)
 83     options = get_options(http, url)
 84     http.close()
 85 
 86     nbdkit.debug("imageio features: flush=%(can_flush)r "
 87                  "zero=%(can_zero)r unix_socket=%(unix_socket)r "
 88                  "max_readers=%(max_readers)r max_writers=%(max_writers)r"
 89                  % options)
 90 
 91     pool = create_http_pool(url, options)


34 seconds later, we see 4 new connections:

2021-12-14 18:30:26,359 INFO    (Thread-38767) [http] OPEN connection=38767 client=::ffff:10.73.224.92
2021-12-14 18:30:26,359 INFO    (Thread-38767) [images] [::ffff:10.73.224.92] OPTIONS ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c
2021-12-14 18:30:26,360 INFO    (Thread-38767) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8
2021-12-14 18:30:26,370 INFO    (Thread-38768) [http] OPEN connection=38768 client=::ffff:10.73.224.92
2021-12-14 18:30:26,370 INFO    (Thread-38768) [images] [::ffff:10.73.224.92] OPTIONS ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c
2021-12-14 18:30:26,371 INFO    (Thread-38768) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8
2021-12-14 18:30:26,433 INFO    (Thread-38769) [http] OPEN connection=38769 client=::ffff:10.73.224.92
2021-12-14 18:30:26,434 INFO    (Thread-38769) [images] [::ffff:10.73.224.92] OPTIONS ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c
2021-12-14 18:30:26,434 INFO    (Thread-38769) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8
2021-12-14 18:30:26,443 INFO    (Thread-38770) [http] OPEN connection=38770 client=::ffff:10.73.224.92
2021-12-14 18:30:26,443 INFO    (Thread-38770) [images] [::ffff:10.73.224.92] OPTIONS ticket=bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c
2021-12-14 18:30:26,444 INFO    (Thread-38770) [backends.nbd] Open backend address='/run/vdsm/nbd/bf5d1d57-9fd5-4f19-ab8c-b3c2d580951c.sock' export_name='' sparse=True max_connections=8

This is not expected, since after_fork should be called exactly once!

So the issue can be:

1. initial connection created successfully
2. after some delay, after_fork is called 4 times again, creating new http pull for each
   call, overwriting the global pool.
3. virt-v2v does is inactive for 120 seconds, so imageio drop the connections.

Comment 28 Nir Soffer 2021-12-15 20:50:14 UTC

In the attached virt-v2v-1.45.94-3-esxi6.5-vddk6.7.log we see:

Dec 13 22:00:09 nbdkit: debug: imageio features: flush=True zero=True unix_socket=None max_readers=8 max_writers=8
Dec 13 22:00:09 nbdkit: debug: creating http pool connections=4
Dec 13 22:00:09 nbdkit: debug: creating https connection host=dell-per740-22.lab.eng.pek2.redhat.com port=54323
Dec 13 22:00:09 nbdkit: debug: creating https connection host=dell-per740-22.lab.eng.pek2.redhat.com port=54323
Dec 13 22:00:09 nbdkit: debug: creating https connection host=dell-per740-22.lab.eng.pek2.redhat.com port=54323
Dec 13 22:00:09 nbdkit: debug: creating https connection host=dell-per740-22.lab.eng.pek2.redhat.com port=54323

This looks correct, but it is not the right log.

We must have virt-v2v log and daemon.log from a failed transfer with DEBUG logs
to continue.

Comment 29 Nir Soffer 2021-12-15 21:25:04 UTC

I think what we see here is this:

In comment 28, see opening a connection to the imageio proxy server running on
the engine host.

    port=54323

When connecting to the proxy, it open a connection to the imageio server on the
host - this always send OPTIONS request. This explains why we see 4 OPTIONS
requests on the imageio server log.

To continue we need:

- Description of the test for each failure
  e.g. where virt-v2v is running (on rhv host? on engine? on another host?)
- imageio debug logs from engine host
- imageio debug logs from the host serving this transfer
- virt-v2v debug log

All logs should show the same transfer, check that transfer_id is the same
in all the logs.

Comment 30 Richard W.M. Jones 2021-12-15 22:49:32 UTC

After some discussion with Nir on IRC we decided that the -oo rhv-direct
option is a problem here (because it uses a slow proxy by default).  I
have filed a bug to reverse the default, see:

https://bugzilla.redhat.com/show_bug.cgi?id=2033096

Comment 31 mxie@redhat.com 2021-12-16 04:01:28 UTC

(In reply to  Laszlo Ersek from comment #10)
> Does this refer to a single connection, or to mutliple connections together? 

It's a single v2v conversion on standalone v2v server.

(In reply to Richard W.M. Jones from comment #30)
> After some discussion with Nir on IRC we decided that the -oo rhv-direct
> option is a problem here (because it uses a slow proxy by default).  I
> have filed a bug to reverse the default, see:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2033096

But I didn't use option "-oo rhv-direct" in v2v conversion.

> To continue we need:
> 
> - Description of the test for each failure
>   e.g. where virt-v2v is running (on rhv host? on engine? on another host?)
> - imageio debug logs from engine host
> - imageio debug logs from the host serving this transfer
> - virt-v2v debug log
> 
> All logs should show the same transfer, check that transfer_id is the same
> in all the logs.

To collect the logs, convert a rhel7.9 guest from ESXi6.5 via vddk7.0.2 + rhv-upload on standalone v2v server(my rhel9 test server), Note: there is no option "-oo rhv-direct" in v2v conversion.

#  virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk7.0.2 -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA   -ip /home/passwd  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-rhel7.9-x86_64 -v -x |& ts > virt-v2v-1.45.94-3-esx6.5-rhel7.9-vddk7.0.2.log
█ 100% [****************************************]


# cat  virt-v2v-1.45.94-3-esx6.5-rhel7.9-vddk7.0.2.log |grep transfer_id
Dec 15 21:22:44     "transfer_id": "71d5d697-cce5-4c2f-a88b-52da3dc95151"


The logs collated on rhv env are:
rhv-node-daemon-esx6.5-rhel7.9-vddk7.0.2.log
rhv-node-vdsm-esx6.5-rhel7.9-vddk7.0.2.log
rhv-server-daemon-esx6.5-rhel7.9-vddk7.0.2.log

Comment 33 Nir Soffer 2021-12-16 23:27:00 UTC

(In reply to Nir Soffer from comment #23)
> I plan to improve the imageio side by supporting configurable socket 
> timeout, based on the already exposed inactivity_timeout property of
> the image transfer.
> 
> See https://github.com/oVirt/ovirt-imageio/issues/14

We have now patches for engine and ovirt-imgageio:
engine: https://gerrit.ovirt.org/c/ovirt-engine/+/118075
imageio: https://github.com/oVirt/ovirt-imageio/pull/15

This can solve this issue for the next RHV release. I will not help
when virt-v2v is accessing current RHV release (<=4.4.9).

Comment 34 Nir Soffer 2021-12-16 23:35:01 UTC

(In reply to mxie from comment #31)
> (In reply to  Laszlo Ersek from comment #10)
> > Does this refer to a single connection, or to mutliple connections together? 
> 
> It's a single v2v conversion on standalone v2v server.
> 
> (In reply to Richard W.M. Jones from comment #30)
> > After some discussion with Nir on IRC we decided that the -oo rhv-direct
> > option is a problem here (because it uses a slow proxy by default).  I
> > have filed a bug to reverse the default, see:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=2033096
> 
> But I didn't use option "-oo rhv-direct" in v2v conversion.

With the -oo rvh_direct option, the image is uploaded to the RHV engine
host, and then the image is uploaded to one of the RHV hypervisors. This
is not efficient and not recommended, and should not be the default mode
in virt-v2v. This is why Richard filed the new bug.

Can you reproduce the same issue when using "-oo rhv-direct"?

This option works only if you can access the RHV hypervisors from the host
performing the upload.

Comment 35 Nir Soffer 2021-12-17 00:03:44 UTC

mxie, do you want to test the ovirt-imageio patch with your setup?

We can test this issue like this:

1. Download a build for centos 8

https://github.com/oVirt/ovirt-imageio/suites/4680140720/artifacts/127451989

2. Unzip and install on both rhv-server and rhv-nodes

    unzip rpm-centos-8.zip
    dnf upgrade rpm-centos-8/*.rpm

This will install ovirt-imageio-daemon-2.4.1-0.202112161451.gitd8eaabc.el8.x86_64
on the nodes.

This version support configurable inactivity timeout, allowing virt-v2v to get
longer timeout that will avoid the failures when vddk input is blocked for several
minutes during the import.

When RHV manager will include the fix, no configuration will be needed on the
hosts, but now we need to configure the timeout on the hosts.

3. Configure larger inactivity timeout

Add drop-in file to ovirt-imageio on both rhv-server and rhv-nodes

    $ cat /etc/ovirt-imageio/conf.d/99-bz2032324.conf 
    [daemon]
    inactivity_timeout = 600

And restart ovirt-imageio service:

    systemctl restart ovirt-imageio

I think with this configuration the import should work without errors.

Comment 36 mxie@redhat.com 2021-12-17 06:39:04 UTC

> Can you reproduce the same issue when using "-oo rhv-direct"?
> 
> This option works only if you can access the RHV hypervisors from the host
> performing the upload.

As rhv4.4 node is based on rhel8, the bug only can be reproduced with rhel9 v2v, I can't try "-oo rhv-direct" on rhv node


(In reply to Nir Soffer from comment #35)
> mxie, do you want to test the ovirt-imageio patch with your setup?
....
> 3. Configure larger inactivity timeout
> 
> Add drop-in file to ovirt-imageio on both rhv-server and rhv-nodes
> 
>     $ cat /etc/ovirt-imageio/conf.d/99-bz2032324.conf 
>     [daemon]
>     inactivity_timeout = 600

I tried to update ovirt-imageio on rhv node, but I can't find 99-bz2032324.conf in etc/ovirt-imageio/conf.d after updating

# unzip rpm-centos-8.zip 
# rpm -U ovirt-imageio-*.rpm
# rpm -qa |grep ovirt-imageio
ovirt-imageio-daemon-2.4.1-0.202112161451.gitd8eaabc.el8.x86_64
ovirt-imageio-common-debuginfo-2.4.1-0.202112161451.gitd8eaabc.el8.x86_64
ovirt-imageio-debugsource-2.4.1-0.202112161451.gitd8eaabc.el8.x86_64
ovirt-imageio-common-2.4.1-0.202112161451.gitd8eaabc.el8.x86_64
ovirt-imageio-client-2.4.1-0.202112161451.gitd8eaabc.el8.x86_64
# systemctl restart vdsmd 
# ls /etc/ovirt-imageio/conf.d/
60-vdsm.conf
# cat /etc/ovirt-imageio/conf.d/60-vdsm.conf |grep daemon
nothing

Comment 37 Nir Soffer 2021-12-17 16:42:59 UTC

(In reply to mxie from comment #36)
> > Can you reproduce the same issue when using "-oo rhv-direct"?
> > 
> > This option works only if you can access the RHV hypervisors from the host
> > performing the upload.
> 
> As rhv4.4 node is based on rhel8, the bug only can be reproduced with rhel9
> v2v, I can't try "-oo rhv-direct" on rhv node

virt-v2v in rhel 9 does not have the -oo rhv_direct?

This option must be available, and it works when you run virt-v2v
on a host which is not a rhv hypervisor.

> (In reply to Nir Soffer from comment #35)
> > mxie, do you want to test the ovirt-imageio patch with your setup?
> ....
> > 3. Configure larger inactivity timeout
> > 
> > Add drop-in file to ovirt-imageio on both rhv-server and rhv-nodes
> > 
> >     $ cat /etc/ovirt-imageio/conf.d/99-bz2032324.conf 
> >     [daemon]
> >     inactivity_timeout = 600
> 
> I tried to update ovirt-imageio on rhv node, but I can't find
> 99-bz2032324.conf in etc/ovirt-imageio/conf.d after updating

Sorry if I was not clear enough. You need to create that file.

Comment 38 Richard W.M. Jones 2021-12-17 16:51:40 UTC

$ virt-v2v -o rhv-upload -oo \?
Output options (-oo) which can be used with -o rhv-upload:

  -oo rhv-cafile=CA.PEM           Set ‘ca.pem’ certificate bundle filename.
  -oo rhv-cluster=CLUSTERNAME     Set RHV cluster name.
  -oo rhv-direct[=true|false]     Use direct transfer mode (default: false).
[..]

You can either use -oo rhv-direct or -oo rhv-direct=true , both are
the same.  Omitting it means false.

Comment 39 Nir Soffer 2021-12-18 20:38:54 UTC

I posted fixes to virt-v2v here:
https://listman.redhat.com/archives/libguestfs/2021-December/msg00187.html

Comment 40 mxie@redhat.com 2021-12-20 16:17:19 UTC

(In reply to Nir Soffer from comment #37)
> > As rhv4.4 node is based on rhel8, the bug only can be reproduced with rhel9
> > v2v, I can't try "-oo rhv-direct" on rhv node
> 
> virt-v2v in rhel 9 does not have the -oo rhv_direct?

Sorry, I misunderstood,v2v conversion is still failed after adding -oo rhv_direct option to the command line

#  virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.7 -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA   -ip /home/passwd  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-win11-x86_64 -oo rhv-direct=true
......
nbdkit: python.0: error: /tmp/v2v.T9bb1M/rhv-upload-plugin.py: zero: error: Traceback (most recent call last):
   File "/tmp/v2v.T9bb1M/rhv-upload-plugin.py", line 242, in zero
    http.request("PATCH", url.path, body=buf, headers=headers)
   File "/usr/lib64/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
   File "/usr/lib64/python3.9/http/client.py", line 1296, in _send_request
    self.putrequest(method, url, **skips)
   File "/usr/lib64/python3.9/http/client.py", line 1122, in putrequest
    raise CannotSendRequest(self.__state)
 http.client.CannotSendRequest: Request-sent

█ 100% [****************************************]
nbdkit: python.4: error: /tmp/v2v.T9bb1M/rhv-upload-plugin.py: flush: error: Traceback (most recent call last):
   File "/tmp/v2v.T9bb1M/rhv-upload-plugin.py", line 291, in flush
    http.request("PATCH", url.path, body=buf, headers=headers)
   File "/usr/lib64/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
   File "/usr/lib64/python3.9/http/client.py", line 1296, in _send_request
    self.putrequest(method, url, **skips)
   File "/usr/lib64/python3.9/http/client.py", line 1122, in putrequest
    raise CannotSendRequest(self.__state)
 http.client.CannotSendRequest: Request-sent

nbd+unix:///?socket=/tmp/v2v.9zJub1/out0: nbd_flush: flush: command failed: Input/output error
virt-v2v: error: nbdcopy command failed, see earlier error messages

If reporting bugs, run virt-v2v with debugging enabled and include the 
complete output:

  virt-v2v -v -x [...]


> Sorry if I was not clear enough. You need to create that file.

The bug can't be reproduced after setting inactivity_timeout = 600 for ovirt-imageio service on rhv server and rhv node

# cat /etc/ovirt-imageio/conf.d/99-bz2032324.conf
[daemon]
inactivity_timeout = 600

# systemctl restart ovirt-imageio

# virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk6.7 -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA   -ip /home/passwd  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-win11-x86_64 
[  25.8] Opening the source
nbdkit: vddk[3]: error: write reply: NBD_CMD_BLOCK_STATUS: Broken pipe[  32.3] Inspecting the source
[  37.4] Checking for sufficient free disk space in the guest
[  37.4] Converting Windows 10 Enterprise to run on KVM
virt-v2v: This guest has virtio drivers installed.
[  40.6] Mapping filesystem data to avoid copying unused and blank areas
[  42.2] Closing the overlay
[  42.3] Assigning disks to buses
[  42.3] Checking if the guest needs BIOS or UEFI to boot
[  42.3] Copying disk 1/1
▀  58% [************************----------------]

Comment 41 Nir Soffer 2021-12-20 22:35:12 UTC

(In reply to mxie from comment #40)
> (In reply to Nir Soffer from comment #37)
> > > As rhv4.4 node is based on rhel8, the bug only can be reproduced with rhel9
> > > v2v, I can't try "-oo rhv-direct" on rhv node
> > 
> > virt-v2v in rhel 9 does not have the -oo rhv_direct?
> 
> Sorry, I misunderstood,v2v conversion is still failed after adding -oo
> rhv_direct option to the command line

Yes, this is expected.

> > Sorry if I was not clear enough. You need to create that file.
> 
> The bug can't be reproduced after setting inactivity_timeout = 600 for
> ovirt-imageio service on rhv server and rhv node
> 
> # cat /etc/ovirt-imageio/conf.d/99-bz2032324.conf
> [daemon]
> inactivity_timeout = 600
> 
> # systemctl restart ovirt-imageio
> 
> # virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it
> vddk -io vddk-libdir=/home/vddk6.7 -io
> vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA 
> -ip /home/passwd  -o rhv-upload -oc
> https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op
> /home/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-win11-x86_64 
> [  25.8] Opening the source
> nbdkit: vddk[3]: error: write reply: NBD_CMD_BLOCK_STATUS: Broken pipe[ 
> 32.3] Inspecting the source
> [  37.4] Checking for sufficient free disk space in the guest
> [  37.4] Converting Windows 10 Enterprise to run on KVM
> virt-v2v: This guest has virtio drivers installed.
> [  40.6] Mapping filesystem data to avoid copying unused and blank areas
> [  42.2] Closing the overlay
> [  42.3] Assigning disks to buses
> [  42.3] Checking if the guest needs BIOS or UEFI to boot
> [  42.3] Copying disk 1/1
> ▀  58% [************************----------------]

Thanks, this confirm that we understand the issue properly.

If you want to test with a more complete fix on the imageio side,
you can test this build:
https://github.com/oVirt/ovirt-imageio/actions/runs/1603922920

You still need to set a timeout on the host, unless you test also
with latest ovirt-engine from ovirt-release-master-snapshot repo:
https://copr.fedorainfracloud.org/coprs/ovirt/ovirt-master-snapshot/package/ovirt-engine/

With latest engine, no configuration is needed on the host, since engine
pass the inactivity_timeout configured by virt-v2v to the imageio daemon
and the daemon configures virt-v2v socket using this timeout.

But to verify this bug, we should test with current and previous RHV
releases.

Comment 42 mxie@redhat.com 2021-12-22 10:29:47 UTC

> # cat /etc/ovirt-imageio/conf.d/99-bz2032324.conf
> [daemon]
> inactivity_timeout = 600
> 
> # systemctl restart ovirt-imageio

Sorry for not waiting for v2v conversion to finish in commment40, Seems inactivity_timeout=600 is not enough 

# virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/home/vddk7.0.2 -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA   -ip /home/passwd  -o rhv-upload -oc https://dell-per740-48.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /home/rhvpasswd  -oo rhv-cluster=NFS -os nfs_data -b ovirtmgmt esx6.5-win11-x86_64
[  25.3] Opening the source
[  32.1] Inspecting the source
[  37.3] Checking for sufficient free disk space in the guest
[  37.3] Converting Windows 10 Enterprise to run on KVM
virt-v2v: This guest has virtio drivers installed.
[  40.8] Mapping filesystem data to avoid copying unused and blank areas
[  42.6] Closing the overlay
[  42.7] Assigning disks to buses
[  42.7] Checking if the guest needs BIOS or UEFI to boot
[  42.7] Copying disk 1/1
█ 100% [****************************************]
[1793.4] Creating output metadata
Traceback (most recent call last):
  File "/tmp/v2v.eA3Po8/rhv-upload-finalize.py", line 172, in <module>
    finalize_transfer(connection, transfer_id, disk_id)
  File "/tmp/v2v.eA3Po8/rhv-upload-finalize.py", line 130, in finalize_transfer
    raise RuntimeError(
RuntimeError: timed out waiting for transfer 97986f30-9ba0-4c4d-8755-ebd93a4833e4 to finalize, transfer is finalizing_success
virt-v2v: error: failed to finalize the transfers, see earlier errors

If reporting bugs, run virt-v2v with debugging enabled and include the 
complete output:

  virt-v2v -v -x [...]

Comment 43 Nir Soffer 2021-12-22 11:06:10 UTC

(In reply to mxie from comment #42)
> > # cat /etc/ovirt-imageio/conf.d/99-bz2032324.conf
> > [daemon]
> > inactivity_timeout = 600
> > 
> > # systemctl restart ovirt-imageio
> 
> Sorry for not waiting for v2v conversion to finish in commment40, Seems
> inactivity_timeout=600 is not enough 
> 
> # virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it
> vddk -io vddk-libdir=/home/vddk7.0.2 -io
> vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA 
> -ip /home/passwd  -o rhv-upload -oc
> https://dell-per740-48.lab.eng.pek2.redhat.com/ovirt-engine/api  -op
> /home/rhvpasswd  -oo rhv-cluster=NFS -os nfs_data -b ovirtmgmt
> esx6.5-win11-x86_64
> [  25.3] Opening the source
> [  32.1] Inspecting the source
> [  37.3] Checking for sufficient free disk space in the guest
> [  37.3] Converting Windows 10 Enterprise to run on KVM
> virt-v2v: This guest has virtio drivers installed.
> [  40.8] Mapping filesystem data to avoid copying unused and blank areas
> [  42.6] Closing the overlay
> [  42.7] Assigning disks to buses
> [  42.7] Checking if the guest needs BIOS or UEFI to boot
> [  42.7] Copying disk 1/1
> █ 100% [****************************************]
> [1793.4] Creating output metadata
> Traceback (most recent call last):
>   File "/tmp/v2v.eA3Po8/rhv-upload-finalize.py", line 172, in <module>
>     finalize_transfer(connection, transfer_id, disk_id)
>   File "/tmp/v2v.eA3Po8/rhv-upload-finalize.py", line 130, in
> finalize_transfer
>     raise RuntimeError(
> RuntimeError: timed out waiting for transfer
> 97986f30-9ba0-4c4d-8755-ebd93a4833e4 to finalize, transfer is
> finalizing_success
> virt-v2v: error: failed to finalize the transfers, see earlier errors

This is another regression in virt-v2v - it tries to finalize the transfer
*before* closing the connections to imageio, and this imageio refuse to
remove a ticket when it has connected clients. Previous versions of virt-v2v
finalized the transfer *after* closing the connections to imageio.

This is fixed in ovirt-imageio upstream - it will time out only if there
are in-flight requests blocked on storage.

    https://github.com/oVirt/ovirt-imageio/commit/e4d08ceb3961ebda64616992367e3270c1512e26

But this version will be available only in the next RHV version, likely
4.4.11.

This must be fixed in virt-v2v by closing the nbdkit output *before*
finalizing the transfer if we want to be compatible with current and previous
RHV versions.

If you want to test RHV with the ovirt-imageio fix, you can install this
build:
https://github.com/oVirt/ovirt-imageio/suites/4717512189/artifacts/129622181

Comment 44 Richard W.M. Jones 2021-12-22 11:12:32 UTC

> This is another regression in virt-v2v - it tries to finalize the transfer
> *before* closing the connections to imageio, and this imageio refuse to
> remove a ticket when it has connected clients. Previous versions of virt-v2v
> finalized the transfer *after* closing the connections to imageio.

Can someone file a new bug about this - will fix it in the new year, but
we shouldn't forget about it in the meantime.

Comment 45 tingting zheng 2021-12-22 12:47:20 UTC

(In reply to Richard W.M. Jones from comment #44)
> > This is another regression in virt-v2v - it tries to finalize the transfer
> > *before* closing the connections to imageio, and this imageio refuse to
> > remove a ticket when it has connected clients. Previous versions of virt-v2v
> > finalized the transfer *after* closing the connections to imageio.
> 
> Can someone file a new bug about this - will fix it in the new year, but
> we shouldn't forget about it in the meantime.

I will ask Ming Xie to file a new bug for that.

Comment 46 mxie@redhat.com 2021-12-23 05:38:43 UTC

(In reply to Richard W.M. Jones from comment #44)
> > This is another regression in virt-v2v - it tries to finalize the transfer
> > *before* closing the connections to imageio, and this imageio refuse to
> > remove a ticket when it has connected clients. Previous versions of virt-v2v
> > finalized the transfer *after* closing the connections to imageio.
> 
> Can someone file a new bug about this - will fix it in the new year, but
> we shouldn't forget about it in the meantime.

Filed the bug 2035154 - virt-v2v wrongly tries to finalize the transfer before closing the connections to imageio

Comment 47 Vera 2022-02-11 09:20:04 UTC

Verified with the version: 
virt-v2v-1.45.98-1.el9.x86_64
libguestfs-1.46.1-2.el9.x86_64
guestfs-tools-1.46.1-6.el9.x86_64
nbdkit-1.28.5-1.el9.x86_64
qemu-img-6.2.0-7.el9.x86_64
libvirt-libs-8.0.0-4.el9.x86_64

Steps:
1. Convert a guest from ESXi6.5 to rhv via vddk6.7 by virt-v2v, the v2v conversion is successful.
# virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/vddk6.7/vmware-vix-disklib-distrib -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA   -ip /v2v-ops/esxpw  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /v2v-ops/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-win2019-x86_64
[   0.0] Setting up the source: -i libvirt -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk esx6.5-win2019-x86_64
[   1.7] Opening the source
[  12.5] Inspecting the source
[  20.8] Checking for sufficient free disk space in the guest
[  20.8] Converting Windows Server 2019 Standard to run on KVM
virt-v2v: This guest has virtio drivers installed.
[  31.4] Mapping filesystem data to avoid copying unused and blank areas
[  32.2] Closing the overlay
[  32.5] Assigning disks to buses
[  32.5] Checking if the guest needs BIOS or UEFI to boot
[  32.5] Setting up the destination: -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -os nfs_data
[  55.1] Copying disk 1/1
█ 100% [****************************************]
[1858.9] Creating output metadata
[1866.3] Finishing off


2. Convert a guest from ESXi6.5 to rhv via vddk7.0.2 by virt-v2v, the v2v conversion is successful.
# virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/root/vddk_libdir/vddklib_7 -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA -ip /v2v-ops/esxpw   -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /v2v-ops/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-rhel8.5-x86_64
[   0.0] Setting up the source: -i libvirt -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk esx6.5-rhel8.5-x86_64
[   2.0] Opening the source
[  10.6] Inspecting the source
[  27.7] Checking for sufficient free disk space in the guest
[  27.7] Converting Red Hat Enterprise Linux 8.5 (Ootpa) to run on KVM
virt-v2v: This guest has virtio drivers installed.
[ 154.9] Mapping filesystem data to avoid copying unused and blank areas
[ 155.5] Closing the overlay
[ 155.7] Assigning disks to buses
[ 155.7] Checking if the guest needs BIOS or UEFI to boot
[ 155.7] Setting up the destination: -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -os nfs_data
[ 178.3] Copying disk 1/1
█ 100% [****************************************]
[2445.3] Creating output metadata
[2450.7] Finishing off

Moving to Verified:Tested.

Comment 50 Vera 2022-02-15 08:39:40 UTC

Verified with the version: 
libguestfs-1.46.1-2.el9.x86_64
guestfs-tools-1.46.1-6.el9.x86_64
nbdkit-1.28.5-1.el9.x86_64
libvirt-libs-8.0.0-4.el9.x86_64
virt-v2v-1.45.98-1.el9.x86_64
qemu-img-6.2.0-8.el9.x86_64


Steps:
1. Convert a guest from ESXi6.5 to rhv via vddk6.7 by virt-v2v, the v2v conversion is successful.
# virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/vddk6.7/vmware-vix-disklib-distrib -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA   -ip /v2v-ops/esxpw  -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /v2v-ops/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-win2022-x86_64
[   0.0] Setting up the source: -i libvirt -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk esx6.5-win2022-x86_64
[   1.8] Opening the source
[  10.4] Inspecting the source
[  18.5] Checking for sufficient free disk space in the guest
[  18.5] Converting Windows Server 2022 Standard to run on KVM
virt-v2v: This guest has virtio drivers installed.
[  26.8] Mapping filesystem data to avoid copying unused and blank areas
[  28.0] Closing the overlay
[  28.2] Assigning disks to buses
[  28.2] Checking if the guest needs BIOS or UEFI to boot
[  28.2] Setting up the destination: -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -os nfs_data
[  42.6] Copying disk 1/1
█ 100% [****************************************]
[1392.6] Creating output metadata
[1403.0] Finishing off


2. Convert a guest from ESXi6.5 to rhv via vddk7.0.2 by virt-v2v, the v2v conversion is successful.
# virt-v2v -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk -io vddk-libdir=/root/vddk_libdir/vddklib_7 -io vddk-thumbprint=1F:97:34:5F:B6:C2:BA:66:46:CB:1A:71:76:7D:6B:50:1E:03:00:EA -ip /v2v-ops/esxpw   -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api  -op /v2v-ops/rhvpasswd  -os nfs_data -b ovirtmgmt esx6.5-rhel8.5-x86_64
[   0.0] Setting up the source: -i libvirt -ic vpx://root.73.141/data/10.73.196.89/?no_verify=1 -it vddk esx6.5-rhel8.5-x86_64
[   2.0] Opening the source
[  10.2] Inspecting the source
[  26.0] Checking for sufficient free disk space in the guest
[  26.0] Converting Red Hat Enterprise Linux 8.5 (Ootpa) to run on KVM
virt-v2v: This guest has virtio drivers installed.
[ 145.4] Mapping filesystem data to avoid copying unused and blank areas
[ 145.9] Closing the overlay
[ 146.2] Assigning disks to buses
[ 146.2] Checking if the guest needs BIOS or UEFI to boot
[ 146.2] Setting up the destination: -o rhv-upload -oc https://dell-per740-22.lab.eng.pek2.redhat.com/ovirt-engine/api -os nfs_data
[ 170.1] Copying disk 1/1
█ 100% [****************************************]
[2301.4] Creating output metadata
[2311.8] Finishing off


Moving to Verified.

Comment 52 errata-xmlrpc 2022-05-17 13:41:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: virt-v2v), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:2566