Bug 499750 - libvirt VM migration fails with "error: Unknown failure"
Summary: libvirt VM migration fails with "error: Unknown failure"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: libvirt
Version: 12
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Cole Robinson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 540715 562017 582111 (view as bug list)
Depends On:
Blocks: F11VirtTarget
TreeView+ depends on / blocked
 
Reported: 2009-05-07 22:31 UTC by Mark McLoughlin
Modified: 2010-10-07 18:58 UTC (History)
16 users (show)

Fixed In Version: libvirt-0.7.1-18.fc12
Clone Of:
Environment:
Last Closed: 2010-07-08 18:18:22 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
client.log (10.44 KB, text/plain)
2009-05-07 22:35 UTC, Mark McLoughlin
no flags Details
daemon.log (6.14 KB, text/plain)
2009-05-07 22:35 UTC, Mark McLoughlin
no flags Details
daemon-strace.log (186.67 KB, text/plain)
2009-05-07 22:36 UTC, Mark McLoughlin
no flags Details

Description Mark McLoughlin 2009-05-07 22:31:40 UTC
(Filing this by proxy for mike.hinz)

+++ This bug was initially created as a clone of Bug #499704 +++

Description of problem:

Attempting to migrate a running or stopped VM fails in all cases.  

Version-Release number of selected component (if applicable):

virsh # version
Compiled against library: libvir 0.6.2
Using library: libvir 0.6.2
Using API: QEMU 0.6.2
Running hypervisor: QEMU 0.10.1

[root@vmh2 Download]# uname -a
Linux vmh2 2.6.29.1-111.fc11.x86_64 #1 SMP Fri Apr 24 10:57:09 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:

Always

Steps to Reproduce:
1.  Connect to the local machine's hypervisor as follows and see the local machines:

virsh # connect qemu:///system

virsh # list --all
 Id Name                 State
----------------------------------
  3 vm1                  running
  - vm2                  shut off


2.  Verify connectivity to the hypervisor of the remote target system as follows:

virsh # connect qemu+tcp://vmh3/system

virsh # uri
qemu+tcp://vmh3/system

virsh # list --all
 Id Name                 State
----------------------------------
  4 vm1-vmh3             running



3.  Attempt the migration as follows:

virsh # connect qemu:///system

virsh # migrate vm2 qemu+tcp://vmh3/system
error: Unknown failure

Above shows first successful connect to the local hypervisor and then failure to migrate the remote hypervisor even though step 2 above clearly shows that we can 100% connect to the remote hypevisor.  

We can demonstrate this same failure with the transport method of tcp, ssh, or tls.  
  
Actual results:

The operation fails and throws errors as follows:

virsh # migrate --live vm1 qemu+tcp://vmh3/system
error: Unknown failure

Expected results:

The VM migration should start and succeed for either stopped or running VMs

Additional info:

This is in a lab environment with all firewalls and selinux disabled on all physical machines.  Connectivity always succeed via tcp method, ssh method, and tls method.  However, migration always fails regardless of the connectivity method attempted.

[root@vmh2 CA]# rpm -q kvm python-virtinst virt-viewer virt-manager
package kvm is not installed
python-virtinst-0.400.3-7.fc11.noarch
virt-viewer-0.0.3-4.fc11.x86_64
virt-manager-0.7.0-4.fc11.x86_64
[root@vmh2 CA]#

[root@vmh2 CA]# uname -a
Linux vmh2 2.6.29.1-111.fc11.x86_64 #1 SMP Fri Apr 24 10:57:09 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

--- Additional comment from mike.hinz on 2009-05-07 14:57:11 EDT ---

Created an attachment (id=342915)
cpu info from hardware

Added as per virtualization bug reporting wiki.

--- Additional comment from mike.hinz on 2009-05-07 14:58:24 EDT ---

Created an attachment (id=342918)
lspci info

Added as per request from virtualization bug reporting wiki

--- Additional comment from mike.hinz on 2009-05-07 15:00:10 EDT ---

Created an attachment (id=342919)
virsh capabilities output

Output of virsh capabilities as per virtualization bug reporting wiki.

Comment 1 Mark McLoughlin 2009-05-07 22:35:13 UTC
Created attachment 342946 [details]
client.log

Comment 2 Mark McLoughlin 2009-05-07 22:35:54 UTC
Created attachment 342947 [details]
daemon.log

Comment 3 Mark McLoughlin 2009-05-07 22:36:47 UTC
Created attachment 342948 [details]
daemon-strace.log

Comment 4 Bug Zapper 2009-06-09 15:25:01 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 5 Daniel Berrangé 2009-08-03 16:28:02 UTC
The 'daemon.log' file from comment #2, shows that the VM was succesfully started up on the destination.  The next thing it shows is libvirt requesting an abort of the migration attempt:

16:52:41.768: debug : virDomainMigrateFinish2:3043 : dconn=0x7fcc24000a30, dname=vm2, cookie=(nil), cookielen=0, uri=tcp:vmh3:49152, flags=0, retcode=-1
16:52:41.768: debug : qemudShutdownVMDaemon:1518 : Shutting down VM 'vm2'


The 'client.log' file from comment #1 seems to shw the libvirt client is operating normally.

So, the only answer left is that something must have gone wrong in the source host's libvirtd daemon during migration.

Thus I think we need to get a libvirt debugging log session from the source libvirtd daemon, so we can see what's happening with the virDomainMigratePerform2 method.

Comment 6 Mike Hinz 2009-08-03 20:35:58 UTC
To hopefully move this along further, I've done some testing.  Based on some of the chat this afternoon on IRC, the below is the command and the contents of the log file:  /var/log/vibvirt/qemu/vm2clone.log. 

virsh # list --all
 Id Name                 State
----------------------------------
  - vm-full-gold         shut off
  - vm1clone             shut off
  - vm2clone             shut off

virsh # migrate vm2clone qemu+tcp://192.168.50.20/system
error: Unknown failure

Then the log file on the source physical host gives this:

LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin /usr/bin/qemu-kvm -S -M pc -m 2000 -smp 1 -name vm2clone -uuid 080c4e4a-572b-ff8a-93d8-1bed40c2d093 -monitor pty -pidfile /var/run/libvirt/qemu//vm2clone.pid -boot c -drive file=,if=ide,media=cdrom,index=2 -drive file=/mnt/nfs-store/vm2clone.img,if=virtio,index=0,boot=on -net nic,macaddr=54:52:00:09:1b:b8,vlan=0 -net tap,fd=19,vlan=0 -serial pty -parallel none -usb -usbdevice tablet -vnc 127.0.0.1:0
char device redirected to /dev/pts/2
char device redirected to /dev/pts/3

I'll continue what the logs show on the destination physical host in the next commetn.

Comment 7 Mike Hinz 2009-08-03 20:46:04 UTC
To follow the above comment, please see the following from the logs on the physical destination machine during the migration attempt:

From /var/log/libvirt/qemu/vm2clong.log

LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin /usr/bin/qemu-kvm -S -M pc -m 2000 -smp 1 -name vm2clone -uuid 080c4e4a-572b-ff8a-93d8-1bed40c2d093 -monitor pty -pidfile /var/run/libvirt/qemu//vm2clone.pid -boot c -drive file=,if=ide,media=cdrom,index=2 -drive file=/mnt/nfs-store/vm2clone.img,if=virtio,index=0,boot=on -net nic,macaddr=54:52:00:09:1b:b8,vlan=0 -net tap,fd=22,vlan=0 -serial pty -parallel none -usb -usbdevice tablet -vnc 127.0.0.1:0 -incoming tcp:0.0.0.0:49171
char device redirected to /dev/pts/2
char device redirected to /dev/pts/3


During the irc session danpb had me create the log file /var/log/libvirt/daemon.log by editing libvirtd.conf on the target machine and setting:

log_filters="1:qemu"
log_outputs="1:file:/var/log/libvirt/daemon.log"

The output of that log file on the destination physical host is as follows:

15:31:31.938: info : Received unexpected signal 17
15:31:31.943: info : Received unexpected signal 17
15:31:32.994: debug : qemudDomainSetMemoryBalloon:2530 : vm2clone: balloon reply: balloon 2000

15:31:32.995: debug : qemudShutdownVMDaemon:1526 : Shutting down VM 'vm2clone'

15:31:32.996: error : invalid domain pointer in no domain with matching uuid
15:31:32.996: debug : qemudDispatchClientFailure:1407 : Deregistering to relay remote events


This is with Fedora 11, fully updated to the latest.  I had an earlier error relating to the sound device in the VM, but I've removed that device.  Also, this VM utilizes the default NAT'd networking, but I've also tried migration using a host with a bridge setup.  That also fails.  

Please let me know what additional info may be needed.

Regards.  Mike

Comment 8 Jay Modi 2009-09-11 01:52:19 UTC
I am experiencing the same issue with an up to date F11 system. Please let me know what information I can provide.

Comment 9 Jay Modi 2009-10-19 09:10:41 UTC
After the latest updates to Fedora 11, I am now able to successfully migrate a running VM using virt-manager; but I cannot migrate the same VM if it is turned off.

Comment 10 Chris Lalancette 2009-10-19 09:33:38 UTC
(In reply to comment #9)
> After the latest updates to Fedora 11, I am now able to successfully migrate a
> running VM using virt-manager; but I cannot migrate the same VM if it is turned
> off.  

Right, this makes sense.  In general, migration is supposed to be "live"; that is, clients of the VM don't notice that it's been moved from one physical place to another.  Therefore, libvirt doesn't really have the concept of migrating a "turned off" VM.

All this would really do would be to copy over the XML from one host to another, since there's no memory to copy, and the disk still has to be shared.  If you are doing that, then it's probably a better idea to just write a simple script to connect to both the source and destination, do "dumpxml" on the source, and then do "define" on the destination.  This might also be an interesting feature request for a new virsh command, but it won't require any new API's.

Chris Lalancette

Comment 11 Chris Lalancette 2009-10-19 09:37:52 UTC
(In reply to comment #7)
> This is with Fedora 11, fully updated to the latest.  I had an earlier error
> relating to the sound device in the VM, but I've removed that device.  Also,
> this VM utilizes the default NAT'd networking, but I've also tried migration
> using a host with a bridge setup.  That also fails.  
> 
> Please let me know what additional info may be needed.

Mike,
     I've recently tracked down a similar problem in the Fedora 12 packages.  In that case, it was due to the fact that "hostname" on the destination machine didn't return something reasonable.  Now, we should definitely do better in libvirt than "unknown error"; I have a patch pending to fix that.  However, can you try doing the following:

1)  Open up port 49152 in the firewall on the destination (if not already done)
2)  On the destination host, make sure that the "hostname" command returns something reasonable (like vmh3.example.org), and that "nslookup vmh3.example.org" also resolves properly.
3)  On the source, run:

# virsh migrate --live vm1 qemu+tcp://vmh3/system tcp://vmh3:49152

And then let us know the results of all of this?

Thanks,
Chris Lalancette

Comment 12 Mark McLoughlin 2009-11-20 19:34:58 UTC
Mike: any chance you can try Chris's suggestions? Thanks

Comment 13 Sean Stoops 2009-12-22 01:30:39 UTC
Chris:  Is it absolutely necessary that 'nslookup' resolves the hostname?  I'm running into what appears to be the same issue covered with this bug on Debian and libvirt-0.7.0.  Each virt host uses /etc/hosts to resolve the other, thus nslookup won't actually resolve since it does not use /etc/hosts.

Comment 14 Naoki 2010-03-29 07:50:26 UTC
Same problem with :

Client 1 - Fedora 12
# rpm -q libvirt qemu-kvm glusterfs-client 
libvirt-0.7.1-15.fc12.x86_64
qemu-kvm-0.11.0-13.fc12.x86_64
glusterfs-client-3.0.3-1.fc11.x86_64

Client 2 - Fedora 12
# rpm -q libvirt qemu-kvm glusterfs-client 
libvirt-0.7.1-15.fc12.x86_64
qemu-kvm-0.11.0-13.fc12.x86_64
glusterfs-client-3.0.3-1.fc11.x86_64


Error on migrate :

# virsh migrate --live gfstest qemu+ssh://x6270-b5.sys.intra/system
error: Unknown failure

Error in messages is : 
"libvirtd: 16:50:01.437: error : qemudDomainMigratePerform:7292 : operation failed: migrate failed: info migrate#012Migration status: failed#015#012"

A manual migration (xml copy and image already visible) works perfectly.

Comment 15 Naoki 2010-03-30 03:06:38 UTC
Ok, after noticing the libvirt shipping in F12 is dated Sept 2009 I upgraded from the F13 branch to the latest 0.7.7. Improved log messages now give me:

Source:
"error: operation failed: Migration unexpectedly failed"


Destination:
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.062: info : qemudDispatchServer:1369 : Turn off polkit auth for privileged client 4716
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.108: info : qemuSecurityDACSetOwnership:40 : Setting DAC context on '/var/lib/libvirt/images/gfstest-disk0' to '0:0'
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.115: info : qemudDispatchSignalEvent:390 : Received unexpected signal 17
Mar 30 12:02:07 x6270-b5 kernel: device vnet0 entered promiscuous mode
Mar 30 12:02:07 x6270-b5 kernel: virbr0: port 2(vnet0) entering learning state
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.119: info : qemudDispatchSignalEvent:390 : Received unexpected signal 17
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.148: info : udevGetDeviceProperty:116 : udev reports device 'vnet0' does not have property 'DRIVER'
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.148: info : udevGetDeviceProperty:116 : udev reports device 'vnet0' does not have property 'PCI_CLASS'
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.148: info : udevSetParent:1222 : Could not find udev parent for device with sysfs path '/sys/devices/virtual/net/vnet0'
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.344: info : qemuSecurityDACRestoreSecurityFileLabel:87 : Restoring DAC context on '/var/lib/libvirt/images/gfstest-disk0'
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.345: info : qemuSecurityDACSetOwnership:40 : Setting DAC context on '/var/lib/libvirt/images/gfstest-disk0' to '0:0'
Mar 30 12:02:07 x6270-b5 kernel: virbr0: port 2(vnet0) entering disabled state
Mar 30 12:02:07 x6270-b5 kernel: device vnet0 left promiscuous mode
Mar 30 12:02:07 x6270-b5 kernel: virbr0: port 2(vnet0) entering disabled state
Mar 30 12:02:07 x6270-b5 libvirtd: 12:02:07.417: info : udevRemoveOneDevice:1202 : Failed to find device to remove that has udev name '/sys/devices/virtual/net/vnet0'

However no 'errors' in there, just some 'info' messages.

Comment 16 Bug Zapper 2010-04-27 14:12:29 UTC
This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 17 Naoki 2010-04-28 02:01:25 UTC
I'm seeing this in F12/F13 so perhaps a version change is needed.

Comment 18 Chris Lalancette 2010-04-28 13:51:48 UTC
(In reply to comment #17)
> I'm seeing this in F12/F13 so perhaps a version change is needed.    

Hm, are you still seeing the "Unknown failure", even with F-13?  Our error reporting should be much improved in F-13 libvirt, so I would expect a different error if you are still having problems.

Chris Lalancette

Comment 19 Cole Robinson 2010-05-26 20:07:32 UTC
F-11 is pretty old at this point, and the migration error reporting fixes are non-trivial, so aren't safe to backport. Moving this bug to F12.

I will be building a new F12 libvirt package in a few days which should improve error reporting here.

Comment 20 Cole Robinson 2010-05-26 20:35:07 UTC
*** Bug 540715 has been marked as a duplicate of this bug. ***

Comment 21 Cole Robinson 2010-05-26 20:36:09 UTC
FYI, I've filed a qemu bug about improved migration error reporting; even when the 'unknown error' issue is fixed, qemu doesn't give us much more info.

https://bugzilla.redhat.com/show_bug.cgi?id=596506

Comment 22 Cole Robinson 2010-05-26 20:37:48 UTC
*** Bug 562017 has been marked as a duplicate of this bug. ***

Comment 23 Cole Robinson 2010-05-26 20:42:30 UTC
*** Bug 582111 has been marked as a duplicate of this bug. ***

Comment 24 Fedora Update System 2010-06-17 16:55:48 UTC
libvirt-0.7.1-18.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/libvirt-0.7.1-18.fc12

Comment 25 Fedora Update System 2010-06-21 13:02:30 UTC
libvirt-0.7.1-18.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update libvirt'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/libvirt-0.7.1-18.fc12

Comment 26 Cole Robinson 2010-06-29 19:16:04 UTC
I've also updated the libvirt FAQ with info about common migration errors like this 'Unknown failure'

http://wiki.libvirt.org/page/FAQ

Comment 27 Fedora Update System 2010-07-08 18:17:20 UTC
libvirt-0.7.1-18.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 28 Kai Meyer 2010-10-07 18:58:12 UTC
I'm having the exact same issues on RHEL 5.5. Looking into anything that would hurt DNS resolution on the server's hostname, I found that the hostname was typo'ed in /etc/hosts. Fixing /etc/hosts resolved this problem for me on RHEL 5.5.


Note You need to log in before you can comment on or make changes to this bug.