Hide Forgot
Created attachment 531319 [details] sample VM config file Description of problem: After a performing a live migration of a Windows Server 2008 R2 (SP1) a reboot of that VM failes. When using a windows Server 2003 R2 (SP2) VM, this problem does not occur. Version-Release number of selected component (if applicable): qemu-kvm: version 0.12.1.2 release 2.160.el6_1.8 libvirt : version 0.8.7 release 18.el6_1.1 kernel : version 2.6.32 release 131.17.1.el6 virtio-win on Windows 2003 R2 version 51.62.102.200 (10-8-2011) virtio-win on windows 2008 R2 version 61.62.102.200 (10-8-2011) or version 6.0.209.605 (20-9-2010) How reproducible: always Steps to Reproduce: 1. start VM on host1 2. perform a live migration to host2 3. Open virt-manager, logon to Windows on this VM, Reboot the VM from witin Windows. Actual results: The VM shuts off. Expected results: The VM should reboot. Additional info: The cluster is a two node cluster. With a GFS2 filesystem and drbd83
Does it happens consistently? What do you mean fail? Crash? Any output? About the storage, if you use GFS2, why need for drbd? Can you detail it a bit more? What happens if NFS is used?
>Does it happens consistently? Yes it does. >What do you mean fail? Crash? Any output? The VM does not reboot it just stops running. The VM "shuts down" instead of restarting. It doesn't appear to crash. When I start the VM again I get no warnings. Before migrating the VM just restarts as expected after a reboot. I do get a crash when I try to change a virtio-win network adapter property like "Offload Tx IP checksum" AFTER a live migration. Before a migration a can change this property without the VM crashing. >if you use GFS2, why need for drbd? I use drbd to synchronize the block devices used for GFS2. On each server I have a partition /dev/sdb1 this partition is used to create a replicated block device between the two cluster nodes. on top of sdb1 a block device /dev/drbd0 is created. /dev/drbd0 is a PV for cluster LVM. a LV is used for GFS2. see: http://www.drbd.org/users-guide/ch-gfs.html >What happens if NFS is used? I don't know, I don't use NFS. I'll start a test without a virtio network adapter but with a e1000 adapter. What can I do to help?
> I'll start a test without a virtio network adapter but with a e1000 adapter. When I use <model type='e1000'/> instead of <model type='virtio'/> then the VM doesn't crash after a migration when I change the "TCP checksum offload" property. Also a reboot works as expected, the VM doesn't shut down. The problem appears to be "virtio" related.
>I have done some more testing. and the problem appears to be the vhost_net kernel module. I found the following: http://www.redhat.com/archives/libvir-list/2011-March/msg00310.html > <dt><code>name</code></dt> > <dd> > The optional <code>name</code> attribute forces which type of > backend driver to use. The value can be either 'qemu' (a > user-space backend) or 'vhost' (a kernel backend, which > requires the vhost module to be provided by the kernel); an > attempt to require the vhost driver without kernel support > will be rejected. If this attribute is not present, then the > domain defaults to 'vhost' if present, but silently falls back > to 'qemu' without error. > <span class="since">Since 0.8.8 (QEMU and KVM only)</span> > </dd> > <dt><code>txmode</code></dt> When I start the VM with the qemu userspace network driver and not the vhost kernel driver then live migration works fine. So I added the following to the "interface" xml section in the VM configuration file. <driver name='qemu'/> modinfo vhost_net shows version 0.0.1 Is there a newer (fixed) version of vhost_net available?
Can you please try using nfs or iscsi instead of GFS/drbd? Let's try to isolate it. There are potential issues that can come from the shared storage.
(In reply to comment #6) > >I have done some more testing. and the problem appears to be the vhost_net kernel module. > > I found the following: > http://www.redhat.com/archives/libvir-list/2011-March/msg00310.html > > > <dt><code>name</code></dt> > > <dd> > > The optional <code>name</code> attribute forces which type of > > backend driver to use. The value can be either 'qemu' (a > > user-space backend) or 'vhost' (a kernel backend, which > > requires the vhost module to be provided by the kernel); an > > attempt to require the vhost driver without kernel support > > will be rejected. If this attribute is not present, then the > > domain defaults to 'vhost' if present, but silently falls back > > to 'qemu' without error. > > <span class="since">Since 0.8.8 (QEMU and KVM only)</span> > > </dd> > > <dt><code>txmode</code></dt> > > When I start the VM with the qemu userspace network driver and not the vhost > kernel driver then live migration works fine. > So I added the following to the "interface" xml section in the VM configuration > file. > > <driver name='qemu'/> > > modinfo vhost_net shows version 0.0.1 > > Is there a newer (fixed) version of vhost_net available? [root@f16 ~]# uname -r 3.2.0-rc1+ [root@f16 ~]# modinfo vhost_net |grep version version: 0.0.1 vhost_net version hasn't changed in upstream, but there are many changes of vhost_net in upstream and rhel kernel. Could you help to test those two scenarios? NFS & Virtio_net & Vhost_net off NFS & Virtio_net & Vhost_net on Could you help to provide qemu commandline, qemu output and other error log?
>Could you help to test those two scenarios? > NFS & Virtio_net & Vhost_net off > NFS & Virtio_net & Vhost_net on > >Could you help to provide qemu commandline, qemu output and other error log? I'll have to setup a NFS server first. I'll try to perform those test in week 51.
I used this guide to setup an nfs server : http://aaronwalrath.wordpress.com/2011/03/18/configure-nfs-server-v3-and-v4-on-scientific-linux-6-and-red-hat-enterprise-linux-rhel-6/ When I try to start a VM using nfs I get the following error [root@vmhost1a libvirt]# virsh create nfstest.xml error: Failed to create domain from nfstest.xml error: unable to set user and group to '107:107' on '/var/lib/libvirt/images/W2K8R2DC-disk0': Invalid argument nfs is mounted on /var/lib/libvirt/images It resembles: https://bugzilla.redhat.com/show_bug.cgi?id=709454
(In reply to comment #11) > I used this guide to setup an nfs server : > http://aaronwalrath.wordpress.com/2011/03/18/configure-nfs-server-v3-and-v4-on-scientific-linux-6-and-red-hat-enterprise-linux-rhel-6/ > > When I try to start a VM using nfs I get the following error > > [root@vmhost1a libvirt]# virsh create nfstest.xml please attach your xml file. > error: Failed to create domain from nfstest.xml > error: unable to set user and group to '107:107' on > '/var/lib/libvirt/images/W2K8R2DC-disk0': Invalid argument > nfs is mounted on /var/lib/libvirt/images > > It resembles: https://bugzilla.redhat.com/show_bug.cgi?id=709454 it's libvirt bug. I am not clear about your test evn, could you help to test by qemu cmdline directly? otherwise, we always be blocked by other problem.
>it's libvirt bug. >could you help to test by qemu cmdline >directly? otherwise, we always be blocked by other problem. Oké, I have always used virsh. I'll try qemu instead. I'll also upgrade the system to EL6.2 soon, it includes a newer libvirt release.
I tested the live migration with vhost_net driver enabled on el6.2 (CentOS 6.2). This time it all worked perfectly. Even the live migration was noticably faster then before. It looks like this bug is solved.