Bug 1167774
Summary: | virt-p2v fails with error:"nbd.c:nbd_receive_negotiate():L501: read failed" | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | tingting zheng <tzheng> | ||||
Component: | libguestfs | Assignee: | John Eckersberg <jeckersb> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.1 | CC: | dyuan, juzhou, mbooth, mzhan, nicolas, ptoscano, rjones, xiaodwan | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | P2V | ||||||
Fixed In Version: | libguestfs-1.28.1-1.18.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-09-07 19:23:02 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1168626 | ||||||
Attachments: |
|
Independently reported on the mailing list: https://www.redhat.com/archives/libguestfs/2014-November/msg00126.html Background: First read: http://libguestfs.org/virt-p2v.1.html#how-virt-p2v-works My theory is this is a race condition. In p2v/conversion.c we call open_data_connection to open each nbd-over-ssh connection. Then we start qemu-nbd (why did I do this afterwards???) Then we start the conversion. At no point do we wait for the data connections & qemu-nbd to be running and ready to accept connections before we start the conversion, so presumably on a sufficiently fast machine this could fail. I have not been able to reproduce the bug, but two independent people have so we have a high confidence there's a real bug here. Reproduced on my laptop, with everything built from the 1.29.10 tag: [ 0.0] Creating an overlay to protect the source from being modified qemu-img create -q -f qcow2 -b 'nbd:localhost:60263' -o 'compat=1.1,backing_fmt=raw' /var/tmp/v2vovl83cdd0.qcow2 nbd.c:nbd_receive_negotiate():L449: read failed qemu-img: Could not open 'nbd:localhost:60263': Invalid argument virt-v2v: error: qemu-img command failed, see earlier errors Upstream commit: https://github.com/libguestfs/libguestfs/commit/33098d23020bd3824a2954823a0dbaff751c814d Setting back to ASSIGNED. I believe the fix is incomplete because I can see a different error (quite rarely): Failed to bind socket: Address already in use virt-p2v: error during conversion: waiting for qemu-nbd to start: recv: Resource temporarily unavailable Here's what's happening when things go wrong. - There's a really weird established connection with both a source and destination port of 50123, if ss is to be believed: $ ss -antp sport = :50123 or dport = :50123 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 127.0.0.1:50123 127.0.0.1:50123 users:(("virt-p2v",pid=9992,fd=6)) - Nothing is actually listening on 50123: $ ss -nlt | grep 50123 | wc -l 0 - And I can't initiate a new connection: $ socat tcp:127.0.0.1:50123 - 2014/12/11 11:10:00 socat[19383] E connect(3, AF=2 127.0.0.1:50123, 16): Connection refused - But I also can't bind to it either: $ socat tcp-listen:50123,bind=127.0.0.1 - 2014/12/11 11:10:45 socat[20228] E bind(3, {AF=2 127.0.0.1:50123}, 16): Address already in use Because it's preventing binding on 127.0.0.1:50123, qemu-nbd fails with the error above of "Failed to bind socket: Address already in use". Since recv() is being called on a socket connected to who knows where, it hits the 10 second receive timeout and fails with -EAGAIN. I'm still not sure how it's getting into the weird state in the first place. The rest makes sense. (In reply to John Eckersberg from comment #8) > Here's what's happening when things go wrong. > > - There's a really weird established connection with both a source and > destination port of 50123, if ss is to be believed: > > $ ss -antp sport = :50123 or dport = :50123 > State Recv-Q Send-Q Local Address:Port Peer > Address:Port > ESTAB 0 0 127.0.0.1:50123 > 127.0.0.1:50123 users:(("virt-p2v",pid=9992,fd=6)) > > - Nothing is actually listening on 50123: > $ ss -nlt | grep 50123 | wc -l > 0 > > - And I can't initiate a new connection: > > $ socat tcp:127.0.0.1:50123 - > 2014/12/11 11:10:00 socat[19383] E connect(3, AF=2 127.0.0.1:50123, 16): > Connection refused > > - But I also can't bind to it either: > > $ socat tcp-listen:50123,bind=127.0.0.1 - > 2014/12/11 11:10:45 socat[20228] E bind(3, {AF=2 127.0.0.1:50123}, 16): > Address already in use > > > Because it's preventing binding on 127.0.0.1:50123, qemu-nbd fails with the > error above of "Failed to bind socket: Address already in use". Since > recv() is being called on a socket connected to who knows where, it hits the > 10 second receive timeout and fails with -EAGAIN. > > I'm still not sure how it's getting into the weird state in the first place. > The rest makes sense. I understand this now. When calling connect() with an unbound socket, the kernel will find an available port to use as the source port. This problem occurs when the kernel assigns 50123 as the source port, and then we try to connect() also to 50123. This ends up in a special case of TCP handshake[1] where we end up skipping any sort of LISTEN and go almost directly to ESTABLISHED with both ends of the connection being the same socket. The timeout is set to 10 seconds, and calling recv() on the socket which we have not written any data into blocks for the 10 seconds and returns with -EAGAIN. I have a fix forthcoming, which is just to coerce the chosen source port to be something other than 50123. [1] http://www.ietf.org/rfc/rfc793.txt, search "Simultaneous initiation" The further fix went upstream as 09080a28878a17b991c812e2b93a8d8394383d04 http://oirase.annexia.org/virt-p2v/ livecd-p2v-201412160442-* are the latest versions of p2v, based on Fedora 21. I did a very quick test and it appears to function, at least as far as starting the conversion. Tested with: libguestfs-1.28.1-1.18.el7.x86_64 virt-v2v-1.28.1-1.18.el7.x86_64 livecd-p2v-201412160442.iso The host can be converted successfully. btw,compared with old version of virt-p2v,the new version can be much more fast and less space used,amazing tool! The host with 150G disk will only use less than 6G after conversion. # ll -lsh total 5.5G 5.5G -rw-r--r--. 1 root root 150G Dec 18 03:01 p2v-laptop-sda I confirm I hit this bug today, with many successful P2V, and many other failing, for the reason declared in this BZ. All the attempt were made in the same conditions, with the firewall DISABLED on the p2v server, and many sshd_config flags set to yes (see the mailing list thread cited above). I downloaded and tested the version cited above (201412160442) and two concurrent P2V are running OK at the time I'm writing. That may be too soon to say it's OK, but this sounds good. Thank you Richard. -- Nicolas Ecarnot (In reply to Nicolas Ecarnot from comment #16) > I confirm I hit this bug today, with many successful P2V, and many other > failing, for the reason declared in this BZ. > > All the attempt were made in the same conditions, with the firewall DISABLED > on the p2v server, and many sshd_config flags set to yes (see the mailing > list thread cited above). > > I downloaded and tested the version cited above (201412160442) and two > concurrent P2V are running OK at the time I'm writing. > > That may be too soon to say it's OK, but this sounds good. > > Thank you Richard. > > -- > Nicolas Ecarnot Thanks for the feedback, hopefully it continues to solve the problem for you. Certainly let us know either way, if it continues to work for you after some time, or if you see the error again. More data is always good! Tested with: libguestfs-1.28.1-1.36.el7.x86_64 virt-v2v-1.28.1-1.36.el7.x86_64 livecd-p2v-201505191242.iso I installed rhel6.7 system on host and it can be converted successfully. After conversion,the guest can be booted successfully. Refer to the above comments,move the bug to VERIFIED. |
Created attachment 961170 [details] virt-p2v log Description virt-p2v fails with error:"nbd.c:nbd_receive_negotiate():L501: read failed" Version: virt-v2v-1.28.1-1.13.el7.x86_64 livecd-p2v-201411201732.iso How reproducible: 100% Steps to Reproduce: 1.Launch virt-p2v boot GUI via iso or virt-p2v image. 2.Use virt-p2v to convert a host with rhel6.6 installed. 3.The conversion fails with the below error: [ 0.0] Creating an overlay to protect the source from being modified qemu-img create -q -f qcow2 -b 'nbd:localhost:46672' -o 'compat=1.1,backing_fmt=raw' /var/tmp/v2vovl464e42.qcow2 nbd.c:nbd_receive_negotiate():L501: read failed qemu-img: /var/tmp/v2vovl464e42.qcow2: Could not open 'nbd:localhost:46672': Could not open image: Invalid argument: Invalid argument virt-v2v: error: qemu-img command failed, see earlier errors Actual results: As describes. Expected results: virt-p2v can convert physcial host successfully. Additional info: Attach the log file.