Description of problem: it is not possible to migrate --live with qemu 1.4 migrate --live --verbose domain1 qemu+ssh://root/system always fails, with not really that great error messages, like for example: " migration job: unexpectedly failed" in qemu log file: error : qemuMonitorIO:605 : internal error End of file from monitor It looks like the qemu process just crashed. it does right after migrating discs (when i use --copy-storage-all). It works for me with older versions of qemu (1.2.xxx) But migration of nonshared storage is not working well in those old qemu versions. I did test migration without libvirt, using just just qemu and manual commands - and it migrated well. So it looks like the problem is with combination of libvirt (any version) and qemu 1.4 Version-Release number of selected component (if applicable): libvirt 1.0.0 -- 1.0.3-r1 (on gentoo) qemu (with kvm) version 1.3.9x, 1.4 How reproducible: everytime i run migration it fails Steps to Reproduce: 1. create domain using libvirt and qemu 2. virsh migrate --live --verbose ${domid} qemu+ssh://root@${target}/system Actual results: "migration job: unexpectedly failed" Expected results: domain migrated and running on the targethost Additional info: Also other people said they have this problem on irc. Noone confirmed that it is working for him.
Created attachment 714494 [details] detailed_logs_and_info attached file contains full logs on source and target libvirt during the attempted migration, domain definition and virsh commands leading to the problem.
David, I think the problem is, on the destination qemu dies with: qemu: warning: error while loading state section id 2 load of migration failed Honestly, I don't know what is meant by 'section id 2' and how to avoid that. I get exactly the same error when trying to migrate qemu build from git.
hi, thanks for trying it. i am also clueless, those logs are really not saying enough even in full debug mode. Here is another hint from irc logs, suggesting that it doesnt work for at least 4 people who tried qemu 1.4 using different operating systems: <mardraum> hi, having trouble with live migration when testing ubuntu raring hosts, looks like libvirt 1.0.2 and qemu 1.4.0, getting "error: operation failed: migration job: unexpectedly failed" with command line eg virsh migrate --liv e --verbose somevm qemu+ssh://otherhost/system - this worked fine before upgrading hosts, dns and reverse for machines all still exists. friend is now also seeing this on his arch hosts
(In reply to comment #2) > qemu: warning: error while loading state section id 2 > load of migration failed The qemu folks will have to say what that means exactly. As I understand it the destination qemu doesn't like the data provided by the source qemu, but why I couldn't say. Are the source and destination qemu versions different?
some more info from irc: section id 2 doesn't mean anything, unfortunately because '2' is a dynamic ID
David, is the src qemu the same version as the dst?
yes, they were the very same version. Both servers were gentoo installations -- so to be sure, i cloned one to another and changed just hostname, ssh keys and network config. The problem persists. After cloning the machine, xml files for migrated domain in /var/run/libvirt/qemu/ are now totally the same on source and target host (they had differences in my previous trials). But it still fails.
also, after the cloning, those domains are run with the same command (but it didnt help with the error, which persists): SOURCE: LC_ALL=C PATH=/bin:/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin HOME=/root USER=root /usr/bin/kvm -name migt10 -S -M pc-i440fx-1.4 -enable-kvm -m 500 -smp 1,sockets=1,cores=1,threads=1 -uuid 0d73c5c3-43d0-f75b-31de-6aa919b0176b -nographic -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/migt10.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -vga cirrus -incoming fd:22 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 TARGET: LC_ALL=C PATH=/bin:/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin HOME=/root USER=root /usr/bin/kvm -name migt10 -S -M pc-i440fx-1.4 -enable-kvm -m 500 -smp 1,sockets=1,cores=1,threads=1 -uuid 0d73c5c3-43d0-f75b-31de-6aa919b0176b -nographic -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/migt10.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -vga cirrus -incoming tcp:0.0.0.0:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3
I just happened to reproduce problem myself too. The command line libvirt generates was the very same (expect -incoming part obviously). Moreover, as you, David, confirmed - the generated command line is the same even for your case - hence I don't see much room for a libvirt bug here. I am attaching debug logs as well.
Created attachment 716058 [details] libvirt debug logs
Does migration work without libvirt?
(In reply to comment #11) > Does migration work without libvirt? I tried migration without libvirt few weeks ago, as stefanha on #qemu asked me to do so. It worked nicely, even with --copy-storage-all (which i really need).
> It worked nicely, even with --copy-storage-all (which i really need). well, qemu does not have this option, i meant migrate -b
(In reply to comment #8) i apologize for pasting 2 TARGET lines. On SOURCE, there should be different -incoming: -incoming fd:22
In my case, that's hard to tell. I mean, I am using a tap device which is being passed via a FD as well as a migration. The migration is actually to a FD provided by libvirt. But if I drop all FDs related things, then yes - I am able to migrate by hand and not to migrate via libvirt. But what if there's a bug in FD handling withing QEMU? I am not sure how to pass a FD to a process from shell so I can test the whole scenario. And I've tested David's domain XML as well.
simplest domain xml for my last tests was: http://pastebin.com/uzpHBXDe
i was asked by eblake to test save & restore, so i runned: # virsh save migt10 /var/tmp/mig --running Domain migt10 saved to /var/tmp/mig # virsh restore /var/tmp/mig Domain restored from /var/tmp/mig so it works; domain is running now: # virshdominfo migt10 Id: 6 Name: migt10 UUID: 0d73c5c3-43d0-f75b-31de-6aa919b0176b OS Type: hvm State: running CPU(s): 1 CPU time: 0.6s Max memory: 512000 KiB Used memory: 512000 KiB Persistent: yes Autostart: disable Managed save: no Security model: none Security DOI: 0
So I have nailed down the location of root cause. If libvirt detects that qemu supports passing a FD to migrate to, this feature is used. So the process looks like this: 1. libvirt connects to the dst side => a FD is produced 2. libvirt passes the FD to the qemu: {"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-8"} 3. libvirt starts migration: {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-9"} However, if the capability is not detected, the process looks slightly different: 1. libvirt starts migration: {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"tcp:host:49152"},"id":"libvirt-9"} While the former always fails, the latter does not a single time. That explains why migration by hand works (neither David neither me passed a FD). David, you can try yourself: 1) start the domain 2) open /var/run/libvirt/qemu/migt10.xml 3) delete line: <flag name='migrate-qemu-fd'/> 4) restart daemon (if you are using gentoo be aware libvirtd init script may be shutting down your domains) 5) proceed with migration So just a rough idea: isn't this 'section id 2' somehow related to the FD passed to qemu? Maybe a FD created the section and since on the destination there is no FD being passed, there's no such section and migration fails then. I'd better ask somebody from qemu devel group.
After my discussion with Stefan Hajnoczi from Qemu, this really is a qemu bug. However, libvirt should workaround older Qemus which have been already released. I've proposed patch upstream in that sense: https://www.redhat.com/archives/libvir-list/2013-March/msg01486.html
The patch has been pushed upstream: commit ceb31795af40f6127a541076b905935ff83e5b11 Author: Michal Privoznik <mprivozn> AuthorDate: Tue Mar 26 15:45:16 2013 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Tue Mar 26 17:16:27 2013 +0100 qemu: Set migration FD blocking Since we switched from direct host migration scheme to the one, where we connect to the destination and then just pass a FD to a qemu, we have uncovered a qemu bug. Qemu expects migration FD to block. However, we are passing a nonblocking one which results in cryptic error messages like: qemu: warning: error while loading state section id 2 load of migration failed The bug is already known to Qemu folks, but we should workaround already released Qemus. Patch has been originally proposed by Stefan Hajnoczi <stefanha> v1.0.4-rc1-9-gceb3179 The patch is going to be part of the upcoming 1.0.4 release.