Bug 1167774

Summary: virt-p2v fails with error:"nbd.c:nbd_receive_negotiate():L501: read failed"
Product: Red Hat Enterprise Linux 7 Reporter: tingting zheng <tzheng>
Component: libguestfsAssignee: John Eckersberg <jeckersb>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.1CC: dyuan, juzhou, mbooth, mzhan, nicolas, ptoscano, rjones, xiaodwan
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: P2V
Fixed In Version: libguestfs-1.28.1-1.18.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-07 19:23:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1168626    
Attachments:
Description Flags
virt-p2v log none

Description tingting zheng 2014-11-25 11:46:15 UTC
Created attachment 961170 [details]
virt-p2v log

Description
virt-p2v fails with error:"nbd.c:nbd_receive_negotiate():L501: read failed" 

Version:
virt-v2v-1.28.1-1.13.el7.x86_64
livecd-p2v-201411201732.iso

How reproducible:
100%

Steps to Reproduce:
1.Launch virt-p2v boot GUI via iso or virt-p2v image.

2.Use virt-p2v to convert a host with rhel6.6 installed.

3.The conversion fails with the below error:
[   0.0] Creating an overlay to protect the source from being modified
qemu-img create -q -f qcow2 -b 'nbd:localhost:46672' -o 'compat=1.1,backing_fmt=raw' /var/tmp/v2vovl464e42.qcow2
nbd.c:nbd_receive_negotiate():L501: read failed
qemu-img: /var/tmp/v2vovl464e42.qcow2: Could not open 'nbd:localhost:46672': Could not open image: Invalid argument: Invalid argument
virt-v2v: error: qemu-img command failed, see earlier errors

Actual results:
As describes.

Expected results:
virt-p2v can convert physcial host successfully.

Additional info:
Attach the log file.

Comment 2 Richard W.M. Jones 2014-11-25 13:50:10 UTC
Independently reported on the mailing list:

https://www.redhat.com/archives/libguestfs/2014-November/msg00126.html

Comment 3 Richard W.M. Jones 2014-11-25 16:40:01 UTC
Background:

First read: http://libguestfs.org/virt-p2v.1.html#how-virt-p2v-works

My theory is this is a race condition.  In p2v/conversion.c we
call open_data_connection to open each nbd-over-ssh connection.

Then we start qemu-nbd (why did I do this afterwards???)

Then we start the conversion.

At no point do we wait for the data connections & qemu-nbd to be
running and ready to accept connections before we start the
conversion, so presumably on a sufficiently fast machine this
could fail.

I have not been able to reproduce the bug, but two independent
people have so we have a high confidence there's a real bug here.

Comment 4 John Eckersberg 2014-11-26 17:43:59 UTC
Reproduced on my laptop, with everything built from the 1.29.10 tag:

[   0.0] Creating an overlay to protect the source from being modified
qemu-img create -q -f qcow2 -b 'nbd:localhost:60263' -o 'compat=1.1,backing_fmt=raw' /var/tmp/v2vovl83cdd0.qcow2
nbd.c:nbd_receive_negotiate():L449: read failed
qemu-img: Could not open 'nbd:localhost:60263': Invalid argument
virt-v2v: error: qemu-img command failed, see earlier errors

Comment 7 Richard W.M. Jones 2014-12-09 15:45:52 UTC
Setting back to ASSIGNED.  I believe the fix is incomplete
because I can see a different error (quite rarely):

Failed to bind socket: Address already in use
virt-p2v: error during conversion: waiting for qemu-nbd to start: recv: Resource temporarily unavailable

Comment 8 John Eckersberg 2014-12-11 16:17:11 UTC
Here's what's happening when things go wrong.

- There's a really weird established connection with both a source and destination port of 50123, if ss is to be believed:

$ ss -antp sport = :50123 or dport = :50123
State       Recv-Q Send-Q        Local Address:Port          Peer Address:Port 
ESTAB       0      0                 127.0.0.1:50123            127.0.0.1:50123  users:(("virt-p2v",pid=9992,fd=6))

- Nothing is actually listening on 50123:
$ ss -nlt | grep 50123 | wc -l
0

- And I can't initiate a new connection:

$ socat tcp:127.0.0.1:50123 -
2014/12/11 11:10:00 socat[19383] E connect(3, AF=2 127.0.0.1:50123, 16): Connection refused

- But I also can't bind to it either:

$ socat tcp-listen:50123,bind=127.0.0.1 -
2014/12/11 11:10:45 socat[20228] E bind(3, {AF=2 127.0.0.1:50123}, 16): Address already in use


Because it's preventing binding on 127.0.0.1:50123, qemu-nbd fails with the error above of "Failed to bind socket: Address already in use".  Since recv() is being called on a socket connected to who knows where, it hits the 10 second receive timeout and fails with -EAGAIN.

I'm still not sure how it's getting into the weird state in the first place.  The rest makes sense.

Comment 9 John Eckersberg 2014-12-12 16:42:52 UTC
(In reply to John Eckersberg from comment #8)
> Here's what's happening when things go wrong.
> 
> - There's a really weird established connection with both a source and
> destination port of 50123, if ss is to be believed:
> 
> $ ss -antp sport = :50123 or dport = :50123
> State       Recv-Q Send-Q        Local Address:Port          Peer
> Address:Port 
> ESTAB       0      0                 127.0.0.1:50123           
> 127.0.0.1:50123  users:(("virt-p2v",pid=9992,fd=6))
> 
> - Nothing is actually listening on 50123:
> $ ss -nlt | grep 50123 | wc -l
> 0
> 
> - And I can't initiate a new connection:
> 
> $ socat tcp:127.0.0.1:50123 -
> 2014/12/11 11:10:00 socat[19383] E connect(3, AF=2 127.0.0.1:50123, 16):
> Connection refused
> 
> - But I also can't bind to it either:
> 
> $ socat tcp-listen:50123,bind=127.0.0.1 -
> 2014/12/11 11:10:45 socat[20228] E bind(3, {AF=2 127.0.0.1:50123}, 16):
> Address already in use
> 
> 
> Because it's preventing binding on 127.0.0.1:50123, qemu-nbd fails with the
> error above of "Failed to bind socket: Address already in use".  Since
> recv() is being called on a socket connected to who knows where, it hits the
> 10 second receive timeout and fails with -EAGAIN.
> 
> I'm still not sure how it's getting into the weird state in the first place.
> The rest makes sense.

I understand this now.  When calling connect() with an unbound socket, the kernel will find an available port to use as the source port.  This problem occurs when the kernel assigns 50123 as the source port, and then we try to connect() also to 50123.  This ends up in a special case of TCP handshake[1] where we end up skipping any sort of LISTEN and go almost directly to ESTABLISHED with both ends of the connection being the same socket.  The timeout is set to 10 seconds, and calling recv() on the socket which we have not written any data into blocks for the 10 seconds and returns with -EAGAIN.

I have a fix forthcoming, which is just to coerce the chosen source port to be something other than 50123.

[1] http://www.ietf.org/rfc/rfc793.txt, search "Simultaneous initiation"

Comment 10 Richard W.M. Jones 2014-12-15 18:50:02 UTC
The further fix went upstream as
09080a28878a17b991c812e2b93a8d8394383d04

Comment 14 Richard W.M. Jones 2014-12-16 10:09:56 UTC
http://oirase.annexia.org/virt-p2v/
livecd-p2v-201412160442-* are the latest versions of p2v, based on
Fedora 21.

I did a very quick test and it appears to function, at least
as far as starting the conversion.

Comment 15 tingting zheng 2014-12-18 08:24:32 UTC
Tested with:
libguestfs-1.28.1-1.18.el7.x86_64
virt-v2v-1.28.1-1.18.el7.x86_64
livecd-p2v-201412160442.iso

The host can be converted successfully.

btw,compared with old version of virt-p2v,the new version can be much more fast and less space used,amazing tool!

The host with 150G disk will only use less than 6G after conversion.
# ll -lsh
total 5.5G
5.5G -rw-r--r--. 1 root root 150G Dec 18 03:01 p2v-laptop-sda

Comment 16 Nicolas Ecarnot 2015-01-07 14:46:08 UTC
I confirm I hit this bug today, with many successful P2V, and many other failing, for the reason declared in this BZ.

All the attempt were made in the same conditions, with the firewall DISABLED on the p2v server, and many sshd_config flags set to yes (see the mailing list thread cited above).

I downloaded and tested the version cited above (201412160442) and two concurrent P2V are running OK at the time I'm writing.

That may be too soon to say it's OK, but this sounds good.

Thank you Richard.

-- 
Nicolas Ecarnot

Comment 17 John Eckersberg 2015-01-07 15:04:35 UTC
(In reply to Nicolas Ecarnot from comment #16)
> I confirm I hit this bug today, with many successful P2V, and many other
> failing, for the reason declared in this BZ.
> 
> All the attempt were made in the same conditions, with the firewall DISABLED
> on the p2v server, and many sshd_config flags set to yes (see the mailing
> list thread cited above).
> 
> I downloaded and tested the version cited above (201412160442) and two
> concurrent P2V are running OK at the time I'm writing.
> 
> That may be too soon to say it's OK, but this sounds good.
> 
> Thank you Richard.
> 
> -- 
> Nicolas Ecarnot

Thanks for the feedback, hopefully it continues to solve the problem for you.  Certainly let us know either way, if it continues to work for you after some time, or if you see the error again.  More data is always good!

Comment 19 tingting zheng 2015-05-29 06:36:34 UTC
Tested with:
libguestfs-1.28.1-1.36.el7.x86_64
virt-v2v-1.28.1-1.36.el7.x86_64
livecd-p2v-201505191242.iso	

I installed rhel6.7 system on host and it can be converted successfully.

After conversion,the guest can be booted successfully.

Refer to the above comments,move the bug to VERIFIED.