Bug 1202453 - libvirt tunnelled migration fails with "migration job: unexpectedly failed"
Summary: libvirt tunnelled migration fails with "migration job: unexpectedly failed"
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Jiri Denemark
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-16 16:27 UTC by Lukas Vacek
Modified: 2016-05-02 14:34 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-05-02 14:34:28 UTC
Embargoed:


Attachments (Terms of Use)
domain.xml (4.62 KB, text/plain)
2015-03-16 16:27 UTC, Lukas Vacek
no flags Details
destination libvirtd log (522 bytes, text/plain)
2015-03-16 16:30 UTC, Lukas Vacek
no flags Details
source node libvirtd log (985 bytes, text/plain)
2015-03-16 16:30 UTC, Lukas Vacek
no flags Details
destination libvirt/qemu/guest.log (2.70 KB, text/plain)
2015-03-16 16:33 UTC, Lukas Vacek
no flags Details
libvirtd source log log_filters="3:rpc 3:remote 3:util.json 3:util.event 3:node_device 3:util.object 3:util.netlink 3:access" (2.58 MB, application/x-gzip)
2015-03-31 13:48 UTC, Lukas Vacek
no flags Details
libvirtd destination log log_filters="3:rpc 3:remote 3:util.json 3:util.event 3:node_device 3:util.object 3:util.netlink 3:access" (3.57 MB, application/x-gzip)
2015-03-31 13:48 UTC, Lukas Vacek
no flags Details
new qemu/guest.log on source (45 bytes, text/plain)
2015-03-31 13:51 UTC, Lukas Vacek
no flags Details
new qemu/guest.log on destination (4.91 KB, text/plain)
2015-03-31 13:52 UTC, Lukas Vacek
no flags Details

Description Lukas Vacek 2015-03-16 16:27:06 UTC
Created attachment 1002391 [details]
domain.xml

There is a bug in libvirt (built from current master: 51f9f03a4ca50b070c0fbfb29748d49f583e15e1) when live migrating a VM with a big storage attached to it - the migration fails with "error: operation failed: migration job: unexpectedly failed". Not sure what's the threshold for the storage size to trigger the bug, but a guest with 30GB storage fails to migrate in our test lab. This only happens when --tunnelled parameter is passed to "virsh migrate".

# virsh migrate --live --p2p --copy-storage-inc --tunnelled ubuntuutopic "qemu+tcp://lab5/system"
error: operation failed: migration job: unexpectedly failed

#on the other hand, this WORKS OK:
virsh migrate --live --p2p --copy-storage-inc ubuntuutopic "qemu+tcp://lab5/system"

libvirt: current master - 51f9f03a4ca50b070c0fbfb29748d49f583e15e1
qemu: 2.0.0+dfsg-2ubuntu1.10
linux kernel: 3.13.0-46-generic #79-Ubuntu

Versions are same on both boxes.

libvirtd.conf only changed to listen on TCP and not to require authentication.

Logs and the domain xml attached.

Steps to Reproduce:
1. create a new domain on host1 (if can't reproduce, you might need to creat a domain with a bigger storage)
2. setup host2 - precreate an empty qcow2 disk in the corresponding location, change libvirtd config to listen on tcp port
3. run "virsh migrate --live --p2p --copy-storage-inc --tunnelled GUEST_VM "qemu+tcp://host2/system" on host1

Actual results:
error: operation failed: migration job: unexpectedly failed


Expected results:
migration succeeds just like when --tunnelled is not used

Domain and logs attached.

Comment 1 Lukas Vacek 2015-03-16 16:30:12 UTC
Created attachment 1002394 [details]
destination libvirtd log

Comment 2 Lukas Vacek 2015-03-16 16:30:57 UTC
Created attachment 1002395 [details]
source node libvirtd log

Comment 3 Lukas Vacek 2015-03-16 16:33:57 UTC
Created attachment 1002397 [details]
destination libvirt/qemu/guest.log

Comment 4 Jiri Denemark 2015-03-16 16:57:28 UTC
As the error message from source daemon suggests, the reason is a different way of transferring disk images with p2p vs tunnelled migration. The preferred way is using NBD but this is unfortunately impossible with tunnelled migration. Thus it falls back to the old way of storage migration. I'm not sure how much this older method is supported by QEMU community but you can try to raise the issue with them. There doesn't seem to be any bug in libvirt here. Except for the lack of NBD support with tunnelled migration. But that's rather a request for new feature.

Comment 5 Lukas Vacek 2015-03-16 17:26:55 UTC
Thanks for quick answer.

Just two things.

1) The migration works fine when --tunnelled is not used. Based on that I'd assume native QEMU migration works fine.
2) Could libvirt provide better errors logs when this fail occurs?

Comment 6 Jiri Denemark 2015-03-16 19:16:39 UTC
1) There are two implementations of storage migration in QEMU. The old variant ("migrate -b" monitor command) and the new variant using NBD. The usage of --tunnelled forces libvirt to switch from NBD to the old implementation when asking QEMU to migrate. It's QEMU doing the migration including storage in both cases. According to the logs NBD based storage migration works fine for you while the old implementation doesn't work.

2) The error actually comes from QEMU so unless it provides anything better to us, we can't report it. And there's nothing interesting in the qemu log on destination host, which doesn't make things any better. Can you also check that log file on the source host?

Comment 7 Lukas Vacek 2015-03-16 22:01:06 UTC
Thanks for clarification.

Comment 8 Lukas Vacek 2015-03-18 12:50:49 UTC
Hi Jiri

I did more tests at my side and it turns out it very well might be an issue in libvirt so I have to reopen this issue.

I have done following tests:

Test A)
1) start libvirt on hostA and hostB
2) start GUEST on hostA
3) create an empty disk on hostB with qemu-img create (might not be necessary with recent enough libvirt)
4) start migration using "virsh migrate --live --p2p --copy-storage-inc --tunnelled GUEST "qemu+tcp://hostB/system"

# at this point migration fails with "unexpectedly failed"

5) Now I stopped libvirt on hostA and hostB
6) I have manually started qemu with -incoming on hostB
7) I have connected via QMP to hostA and executed "migrate blk=true inc=true uri=tcp:10.0.1.31:49152" in qmp-shell

# migration fails with "unexpectedly failed"

At this point I suspected the problem to be in qemu. However, when I do everything manually the live migration works. ie.:

Test B)
1) stop libvirt on hostA and hostB
2) start GUEST on hostA
3) create an empty disk on hostB with qemu-img create
4) start qemu with -incoming on hostB
5) connect via QMP to hostA and execute "migrate blk=true inc=true uri=tcp:10.0.1.31:49152" in qmp-shell

migration works!

btw. this issue might be more important than it seems because openstack nova defaults to tunneled migration.

Thanks!

Comment 9 Jiri Denemark 2015-03-19 13:13:18 UTC
(In reply to Lukas Vacek from comment #8)
> Test B)
> 1) stop libvirt on hostA and hostB
> 2) start GUEST on hostA
> 3) create an empty disk on hostB with qemu-img create

I see it now. Another difference between migrating storage using NBD vs. the old way is in this step 3. Current libvirt (as of 1.2.13) will precreate the disk on the destination host but only when NBD is used. If it's not used, the files need to be properly created on the destination before starting migration (I think Nova takes care of this). With older libvirt, the files need to exist even if NBD is used.

Comment 10 Lukas Vacek 2015-03-19 14:08:15 UTC
Agreed. But I don't think it's the cause of the issue because I precreate the files exactly the same way in Test A and Test B.

Comment 11 Jiri Denemark 2015-03-19 14:14:13 UTC
Heh, I'm blind.

Anyway, could you please post the logs I asked for on IRC few days ago? Turn on debug logs (http://wiki.libvirt.org/page/DebugLogs), run the migration and attach libvirtd.log and guest.log files from both source and destination hosts.

Comment 12 Lukas Vacek 2015-03-31 13:48:05 UTC
Created attachment 1009066 [details]
libvirtd source log log_filters="3:rpc 3:remote 3:util.json 3:util.event 3:node_device 3:util.object 3:util.netlink 3:access"

Comment 13 Lukas Vacek 2015-03-31 13:48:36 UTC
Created attachment 1009067 [details]
libvirtd destination log log_filters="3:rpc 3:remote 3:util.json 3:util.event 3:node_device 3:util.object 3:util.netlink 3:access"

Comment 14 Lukas Vacek 2015-03-31 13:51:46 UTC
Created attachment 1009068 [details]
new qemu/guest.log on source

Comment 15 Lukas Vacek 2015-03-31 13:52:01 UTC
Created attachment 1009070 [details]
new qemu/guest.log on destination

Comment 16 Lukas Vacek 2015-03-31 13:52:53 UTC
First of all, sorry I didn't get to this earlier. We did some reorganizing of our lab env so I could reproduce the test with logs on only now.

It dies with another error now. However, direct qemu migration works as does not-tunnelled libvirt migration.

root@lab1:/var/lib/libvirt# virsh migrate --live --p2p --copy-storage-inc --tunnelled ubuntuutopic "qemu+tcp://lab2/system"
error: Unable to read from monitor: Connection reset by peer

Libvirt debug logs attached.

Comment 17 Kashyap Chamarthy 2015-04-08 07:38:40 UTC
(In reply to Lukas Vacek from comment #16)
> First of all, sorry I didn't get to this earlier. We did some reorganizing
> of our lab env so I could reproduce the test with logs on only now.
> 
> It dies with another error now. However, direct qemu migration works as does
> not-tunnelled libvirt migration.
> 
> root@lab1:/var/lib/libvirt# virsh migrate --live --p2p --copy-storage-inc
> --tunnelled ubuntuutopic "qemu+tcp://lab2/system"
> error: Unable to read from monitor: Connection reset by peer

Just a side question, can you also reproduce it with qemu+ssh? I was just testing a slight variant of the above CLI yesterday with qemu+ssh on Fedora 22, and it worked:

    $ virsh migrate --verbose --copy-storage-all --p2p --live cvm1 \
        qemu+ssh://root@desthost/system

(NOTE: The above assumes root on src can SSH to dst without any password prompt, so, for testing you might want to quickly create SSH keys with empty passphrase, assuming it's a trusted network.)

> Libvirt debug logs attached.

Comment 18 Lukas Vacek 2015-04-08 08:17:53 UTC
Just wondering, is qemu+tcp working for you or not?

Comment 19 Kashyap Chamarthy 2015-04-09 11:40:24 UTC
Yes, qemu+tcp is working for me, I tested four variants (refer further
below) with these versions:

    kernel-4.0.0-0.rc5.git4.1.fc22.x86_64
    libvirt-daemon-kvm-1.2.13-2.fc22.x86_64
    qemu-system-x86-2.3.0-0.2.rc1.fc22.x86_64

Config setup
------------
I had this config in destination's libvirtd.conf:

    $ cat /etc/libvirt/libvirtd.conf | grep -v ^$ | grep -v ^#
    listen_tls = 0
    listen_tcp = 1
    auth_tcp = "none"

And started the libvirtd daemon on the destination with:

    $ cat /etc/sysconfig/libvirtd | grep -v ^$ | grep -v ^#
    LIBVIRTD_ARGS="--listen"


Since I'm testing in a trusted network, I also had SSH access (via
public/private keys) to root on destinatoin host without any password
prompts.


Tests
-----

I just tested three variants of migration with qemu+tcp, successfully:


(1) Native migration, client to two libvirtd servers

    $ virsh migrate --verbose --copy-storage-all \
        --live cvm1 qemu+tcp://kashyapc@devstack3/system

(2)  Native migration, client to and peer2peer between, two libvirtd servers

    $ virsh migrate --verbose --copy-storage-all \ 
         --p2p --live cvm1 qemu+tcp://kashyapc@devstack3/system

(3) Tunnelled migration, client and peer2peer between two libvirtd servers

    $ virsh migrate --verbose  --copy-storage-all \
        --p2p --tunnelled --live cvm1 qemu+tcp://kashyapc@devstack3/system


Successful libvirtd log (with debug filter set) for the 3rd variant:

    https://kashyapc.fedorapeople.org/virt/temp/tunnelled-p2p-migration-qemu-tcp-libvirtd-log.txt


Additionally, I also tested the below (without explicit
'--copy-storage-all' flag, it works too):

    $ virsh migrate --verbose  --p2p --tunnelled \
        --live cvm1 qemu+tcp://kashyapc@devstack3/system

Comment 20 Kashyap Chamarthy 2015-04-09 12:00:07 UTC
Closing the bug, per comment #19. Feel free to reopen in case you can provide a reliable reproducer with appropriate logs.

Comment 21 Lukas Vacek 2015-04-09 12:04:37 UTC
I'd like to test with qemu+ssh but after I have provided the debug logs I have downgraded qemu on our lab boxes.

I think it would be best to raise a separate issue for the problem with qemu+ssh.

Thanks,
Lucas

Comment 22 Kashyap Chamarthy 2015-04-09 12:17:30 UTC
(In reply to Kashyap Chamarthy from comment #19)

[. . .]

[Just correcting the terminology for migration scenarios (2) and (3).]

Assuming I'm reading this doc correctly. (Libvirt devs, please correct me if I'm wrong.)

    http://libvirt.org/migration.html#scenarios
 
> Tests
> -----
> 
> I just tested three variants of migration with qemu+tcp, successfully:
> 
> 
> (1) Native migration, client to two libvirtd servers
> 
>     $ virsh migrate --verbose --copy-storage-all \
>         --live cvm1 qemu+tcp://kashyapc@devstack3/system
> 
> (2)  Native migration, client to and peer2peer between, two libvirtd servers

The below is called "Native migration, peer2peer between two libvirtd servers"

Refer: http://libvirt.org/migration.html#nativepeer2peer

> 
>     $ virsh migrate --verbose --copy-storage-all \ 
>          --p2p --live cvm1 qemu+tcp://kashyapc@devstack3/system
> 
> (3) Tunnelled migration, client and peer2peer between two libvirtd servers

The below is called "Tunnelled migration, peer2peer between two libvirtd servers" Refer: http://libvirt.org/migration.html#scenariotunnelpeer2peer2

> 
>     $ virsh migrate --verbose  --copy-storage-all \
>         --p2p --tunnelled --live cvm1 qemu+tcp://kashyapc@devstack3/system

[. . .]

Comment 23 Lukas Vacek 2015-04-27 11:22:24 UTC
bump

Comment 24 Frank 2015-05-28 21:10:36 UTC
Hi,

I ran into the same problem and using qemu+tcp instead of qemu+ssh solved it. However, it took me a lot of hours to figure this out. :(

I like to add that the error seems to depend on the VMs workload. I was able to reproduce the error with a higher workload while the live migration worked fine with a lighter workload.

Best,
Frank

Comment 25 Cole Robinson 2016-04-10 20:43:11 UTC
It's been a while since the last report. Is anyone still seeing this with more recent libvirt + distro?

Comment 26 Cole Robinson 2016-05-02 14:34:28 UTC
Since there's no response, closing as DEFERRED. But if anyone is still affected with newer libvirt versions, please re-open and we can triage from there


Note You need to log in before you can comment on or make changes to this bug.