Bug 750786 - reboot failed after live migration on windows 2008 R2 (SP1)
Summary: reboot failed after live migration on windows 2008 R2 (SP1)
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: qemu-kvm
Version: 6.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Amos Kong
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-11-02 11:37 UTC by Maurits van de Lande
Modified: 2015-05-25 00:06 UTC (History)
10 users (show)

(edit)
Clone Of:
(edit)
Dual AMD opteron 6128 server
Two network bonds in mode 4 (802.3ad) One bond is used for KVM networking, the other for host access,drbd and clustering.
Last Closed: 2012-01-11 22:54:22 UTC


Attachments (Terms of Use)
sample VM config file (2.07 KB, text/xml)
2011-11-02 11:37 UTC, Maurits van de Lande
no flags Details

Description Maurits van de Lande 2011-11-02 11:37:01 UTC
Created attachment 531319 [details]
sample VM config file

Description of problem:
After a performing a live migration of a Windows Server 2008 R2 (SP1) a reboot of that VM failes.
When using a windows Server 2003 R2 (SP2) VM, this problem does not occur.


Version-Release number of selected component (if applicable):

qemu-kvm: version 0.12.1.2 release 2.160.el6_1.8
libvirt : version 0.8.7    release 18.el6_1.1
kernel  : version 2.6.32   release 131.17.1.el6
virtio-win on Windows 2003 R2 version 51.62.102.200 (10-8-2011)
virtio-win on windows 2008 R2 version 61.62.102.200 (10-8-2011) or
                              version 6.0.209.605 (20-9-2010)

How reproducible:
always

Steps to Reproduce:
1. start VM on host1
2. perform a live migration to host2
3. Open virt-manager, logon to Windows on this VM, Reboot the VM from witin Windows.
  
Actual results:
The VM shuts off.

Expected results:
The VM should reboot.

Additional info:
The cluster is a  two node cluster. With a GFS2 filesystem and drbd83

Comment 2 Dor Laor 2011-11-02 12:37:53 UTC
Does it happens consistently? What do you mean fail? Crash? Any output?
About the storage, if you use GFS2, why need for drbd? Can you detail it a bit more? What happens if NFS is used?

Comment 3 Maurits van de Lande 2011-11-02 13:06:20 UTC
>Does it happens consistently?
Yes it does.

>What do you mean fail? Crash? Any output?
The VM does not reboot it just stops running. The VM "shuts down" instead of restarting. It doesn't appear to crash. When I start the VM again I get no warnings. Before migrating the VM just restarts as expected after a reboot.

I do get a crash when I try to change a virtio-win network adapter property like "Offload Tx IP checksum" AFTER a live migration. Before a migration a can change this property without the VM crashing.

>if you use GFS2, why need for drbd?
I use drbd to synchronize the block devices used for GFS2.
On each server I have a partition /dev/sdb1 this partition is used to create a replicated block device between the two cluster nodes. on top of sdb1 a block device /dev/drbd0 is created. /dev/drbd0 is a PV for cluster LVM. a LV is used for GFS2.
see: http://www.drbd.org/users-guide/ch-gfs.html

>What happens if NFS is used?
I don't know, I don't use NFS.

I'll start a test without a virtio network adapter but with a e1000 adapter.

What can I do to help?

Comment 4 Maurits van de Lande 2011-11-02 13:20:48 UTC
> I'll start a test without a virtio network adapter but with a e1000 adapter.

When I use <model type='e1000'/> instead of <model type='virtio'/> then the VM doesn't crash after a migration when I change the "TCP checksum offload" property.
Also a reboot works as expected, the VM doesn't shut down.

The problem appears to be "virtio" related.

Comment 6 Maurits van de Lande 2011-11-03 17:15:43 UTC
>I have done some more testing. and the problem appears to be the vhost_net kernel module.

I found the following: 
http://www.redhat.com/archives/libvir-list/2011-March/msg00310.html

>     <dt><code>name</code></dt>
>      <dd>
>        The optional <code>name</code> attribute forces which type of
>        backend driver to use. The value can be either 'qemu' (a
>        user-space backend) or 'vhost' (a kernel backend, which
>        requires the vhost module to be provided by the kernel); an
>        attempt to require the vhost driver without kernel support
>        will be rejected.  If this attribute is not present, then the
>        domain defaults to 'vhost' if present, but silently falls back
>        to 'qemu' without error.
>        <span class="since">Since 0.8.8 (QEMU and KVM only)</span>
>      </dd>
>     <dt><code>txmode</code></dt>

When I start the VM with the qemu userspace network driver and not the vhost kernel driver then live migration works fine.
So I added the following to the "interface" xml section in the VM configuration file.

 <driver name='qemu'/>

modinfo vhost_net shows version 0.0.1

Is there a newer (fixed) version of vhost_net available?

Comment 8 Dor Laor 2011-12-08 11:23:02 UTC
Can you please try using nfs or iscsi instead of GFS/drbd?
Let's try to isolate it. There are potential issues that can come from the shared storage.

Comment 9 Amos Kong 2011-12-15 10:35:24 UTC
(In reply to comment #6)
> >I have done some more testing. and the problem appears to be the vhost_net kernel module.
> 
> I found the following: 
> http://www.redhat.com/archives/libvir-list/2011-March/msg00310.html
> 
> >     <dt><code>name</code></dt>
> >      <dd>
> >        The optional <code>name</code> attribute forces which type of
> >        backend driver to use. The value can be either 'qemu' (a
> >        user-space backend) or 'vhost' (a kernel backend, which
> >        requires the vhost module to be provided by the kernel); an
> >        attempt to require the vhost driver without kernel support
> >        will be rejected.  If this attribute is not present, then the
> >        domain defaults to 'vhost' if present, but silently falls back
> >        to 'qemu' without error.
> >        <span class="since">Since 0.8.8 (QEMU and KVM only)</span>
> >      </dd>
> >     <dt><code>txmode</code></dt>
> 
> When I start the VM with the qemu userspace network driver and not the vhost
> kernel driver then live migration works fine.
> So I added the following to the "interface" xml section in the VM configuration
> file.
> 
>  <driver name='qemu'/>
> 
> modinfo vhost_net shows version 0.0.1
> 
> Is there a newer (fixed) version of vhost_net available?

[root@f16 ~]# uname -r
3.2.0-rc1+
[root@f16 ~]# modinfo vhost_net |grep version
version:        0.0.1

vhost_net version hasn't changed in upstream, but there are many changes of vhost_net in upstream and rhel kernel.

Could you help to test those two scenarios?
 NFS & Virtio_net & Vhost_net off
 NFS & Virtio_net & Vhost_net on

Could you help to provide qemu commandline, qemu output and other error log?

Comment 10 Maurits van de Lande 2011-12-15 11:01:35 UTC
>Could you help to test those two scenarios?
> NFS & Virtio_net & Vhost_net off
> NFS & Virtio_net & Vhost_net on
>
>Could you help to provide qemu commandline, qemu output and other error log?

I'll have to setup a NFS server first. I'll try to perform those test in week 51.

Comment 11 Maurits van de Lande 2011-12-15 15:40:06 UTC
I used this guide to setup an nfs server :
http://aaronwalrath.wordpress.com/2011/03/18/configure-nfs-server-v3-and-v4-on-scientific-linux-6-and-red-hat-enterprise-linux-rhel-6/
 
When I try to start a VM using nfs I get the following error

[root@vmhost1a libvirt]# virsh create nfstest.xml
error: Failed to create domain from nfstest.xml
error: unable to set user and group to '107:107' on '/var/lib/libvirt/images/W2K8R2DC-disk0': Invalid argument

nfs is mounted on /var/lib/libvirt/images

It resembles: https://bugzilla.redhat.com/show_bug.cgi?id=709454

Comment 12 Amos Kong 2011-12-19 06:00:58 UTC
(In reply to comment #11)
> I used this guide to setup an nfs server :
> http://aaronwalrath.wordpress.com/2011/03/18/configure-nfs-server-v3-and-v4-on-scientific-linux-6-and-red-hat-enterprise-linux-rhel-6/
> 
> When I try to start a VM using nfs I get the following error
> 
> [root@vmhost1a libvirt]# virsh create nfstest.xml

please attach your xml file.

> error: Failed to create domain from nfstest.xml
> error: unable to set user and group to '107:107' on
> '/var/lib/libvirt/images/W2K8R2DC-disk0': Invalid argument
> nfs is mounted on /var/lib/libvirt/images
> 
> It resembles: https://bugzilla.redhat.com/show_bug.cgi?id=709454

it's libvirt bug.

I am not clear about your test evn, could you help to test by qemu cmdline directly? otherwise, we always be blocked by other problem.

Comment 13 Maurits van de Lande 2011-12-19 15:07:13 UTC
>it's libvirt bug.
>could you help to test by qemu cmdline 
>directly? otherwise, we always be blocked by other problem.

Oké, I have always used virsh. I'll try qemu instead.

I'll also upgrade the system to EL6.2 soon, it includes a newer libvirt release.

Comment 14 Maurits van de Lande 2012-01-11 21:35:35 UTC
I tested the live migration with vhost_net driver enabled on el6.2 (CentOS 6.2). This time it all worked perfectly. Even the live migration was noticably faster then before. 

It looks like this bug is solved.


Note You need to log in before you can comment on or make changes to this bug.