Bug 750786

Summary: reboot failed after live migration on windows 2008 R2 (SP1)
Product: Red Hat Enterprise Linux 6 Reporter: Maurits van de Lande <m.vandelande>
Component: qemu-kvmAssignee: Amos Kong <akong>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.1CC: acathrow, ailan, bsarathy, juzhang, michen, mkenneth, m.vandelande, rhod, tburke, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Dual AMD opteron 6128 server Two network bonds in mode 4 (802.3ad) One bond is used for KVM networking, the other for host access,drbd and clustering.
Last Closed: 2012-01-11 22:54:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sample VM config file none

Description Maurits van de Lande 2011-11-02 11:37:01 UTC
Created attachment 531319 [details]
sample VM config file

Description of problem:
After a performing a live migration of a Windows Server 2008 R2 (SP1) a reboot of that VM failes.
When using a windows Server 2003 R2 (SP2) VM, this problem does not occur.


Version-Release number of selected component (if applicable):

qemu-kvm: version 0.12.1.2 release 2.160.el6_1.8
libvirt : version 0.8.7    release 18.el6_1.1
kernel  : version 2.6.32   release 131.17.1.el6
virtio-win on Windows 2003 R2 version 51.62.102.200 (10-8-2011)
virtio-win on windows 2008 R2 version 61.62.102.200 (10-8-2011) or
                              version 6.0.209.605 (20-9-2010)

How reproducible:
always

Steps to Reproduce:
1. start VM on host1
2. perform a live migration to host2
3. Open virt-manager, logon to Windows on this VM, Reboot the VM from witin Windows.
  
Actual results:
The VM shuts off.

Expected results:
The VM should reboot.

Additional info:
The cluster is a  two node cluster. With a GFS2 filesystem and drbd83

Comment 2 Dor Laor 2011-11-02 12:37:53 UTC
Does it happens consistently? What do you mean fail? Crash? Any output?
About the storage, if you use GFS2, why need for drbd? Can you detail it a bit more? What happens if NFS is used?

Comment 3 Maurits van de Lande 2011-11-02 13:06:20 UTC
>Does it happens consistently?
Yes it does.

>What do you mean fail? Crash? Any output?
The VM does not reboot it just stops running. The VM "shuts down" instead of restarting. It doesn't appear to crash. When I start the VM again I get no warnings. Before migrating the VM just restarts as expected after a reboot.

I do get a crash when I try to change a virtio-win network adapter property like "Offload Tx IP checksum" AFTER a live migration. Before a migration a can change this property without the VM crashing.

>if you use GFS2, why need for drbd?
I use drbd to synchronize the block devices used for GFS2.
On each server I have a partition /dev/sdb1 this partition is used to create a replicated block device between the two cluster nodes. on top of sdb1 a block device /dev/drbd0 is created. /dev/drbd0 is a PV for cluster LVM. a LV is used for GFS2.
see: http://www.drbd.org/users-guide/ch-gfs.html

>What happens if NFS is used?
I don't know, I don't use NFS.

I'll start a test without a virtio network adapter but with a e1000 adapter.

What can I do to help?

Comment 4 Maurits van de Lande 2011-11-02 13:20:48 UTC
> I'll start a test without a virtio network adapter but with a e1000 adapter.

When I use <model type='e1000'/> instead of <model type='virtio'/> then the VM doesn't crash after a migration when I change the "TCP checksum offload" property.
Also a reboot works as expected, the VM doesn't shut down.

The problem appears to be "virtio" related.

Comment 6 Maurits van de Lande 2011-11-03 17:15:43 UTC
>I have done some more testing. and the problem appears to be the vhost_net kernel module.

I found the following: 
http://www.redhat.com/archives/libvir-list/2011-March/msg00310.html

>     <dt><code>name</code></dt>
>      <dd>
>        The optional <code>name</code> attribute forces which type of
>        backend driver to use. The value can be either 'qemu' (a
>        user-space backend) or 'vhost' (a kernel backend, which
>        requires the vhost module to be provided by the kernel); an
>        attempt to require the vhost driver without kernel support
>        will be rejected.  If this attribute is not present, then the
>        domain defaults to 'vhost' if present, but silently falls back
>        to 'qemu' without error.
>        <span class="since">Since 0.8.8 (QEMU and KVM only)</span>
>      </dd>
>     <dt><code>txmode</code></dt>

When I start the VM with the qemu userspace network driver and not the vhost kernel driver then live migration works fine.
So I added the following to the "interface" xml section in the VM configuration file.

 <driver name='qemu'/>

modinfo vhost_net shows version 0.0.1

Is there a newer (fixed) version of vhost_net available?

Comment 8 Dor Laor 2011-12-08 11:23:02 UTC
Can you please try using nfs or iscsi instead of GFS/drbd?
Let's try to isolate it. There are potential issues that can come from the shared storage.

Comment 9 Amos Kong 2011-12-15 10:35:24 UTC
(In reply to comment #6)
> >I have done some more testing. and the problem appears to be the vhost_net kernel module.
> 
> I found the following: 
> http://www.redhat.com/archives/libvir-list/2011-March/msg00310.html
> 
> >     <dt><code>name</code></dt>
> >      <dd>
> >        The optional <code>name</code> attribute forces which type of
> >        backend driver to use. The value can be either 'qemu' (a
> >        user-space backend) or 'vhost' (a kernel backend, which
> >        requires the vhost module to be provided by the kernel); an
> >        attempt to require the vhost driver without kernel support
> >        will be rejected.  If this attribute is not present, then the
> >        domain defaults to 'vhost' if present, but silently falls back
> >        to 'qemu' without error.
> >        <span class="since">Since 0.8.8 (QEMU and KVM only)</span>
> >      </dd>
> >     <dt><code>txmode</code></dt>
> 
> When I start the VM with the qemu userspace network driver and not the vhost
> kernel driver then live migration works fine.
> So I added the following to the "interface" xml section in the VM configuration
> file.
> 
>  <driver name='qemu'/>
> 
> modinfo vhost_net shows version 0.0.1
> 
> Is there a newer (fixed) version of vhost_net available?

[root@f16 ~]# uname -r
3.2.0-rc1+
[root@f16 ~]# modinfo vhost_net |grep version
version:        0.0.1

vhost_net version hasn't changed in upstream, but there are many changes of vhost_net in upstream and rhel kernel.

Could you help to test those two scenarios?
 NFS & Virtio_net & Vhost_net off
 NFS & Virtio_net & Vhost_net on

Could you help to provide qemu commandline, qemu output and other error log?

Comment 10 Maurits van de Lande 2011-12-15 11:01:35 UTC
>Could you help to test those two scenarios?
> NFS & Virtio_net & Vhost_net off
> NFS & Virtio_net & Vhost_net on
>
>Could you help to provide qemu commandline, qemu output and other error log?

I'll have to setup a NFS server first. I'll try to perform those test in week 51.

Comment 11 Maurits van de Lande 2011-12-15 15:40:06 UTC
I used this guide to setup an nfs server :
http://aaronwalrath.wordpress.com/2011/03/18/configure-nfs-server-v3-and-v4-on-scientific-linux-6-and-red-hat-enterprise-linux-rhel-6/
 
When I try to start a VM using nfs I get the following error

[root@vmhost1a libvirt]# virsh create nfstest.xml
error: Failed to create domain from nfstest.xml
error: unable to set user and group to '107:107' on '/var/lib/libvirt/images/W2K8R2DC-disk0': Invalid argument

nfs is mounted on /var/lib/libvirt/images

It resembles: https://bugzilla.redhat.com/show_bug.cgi?id=709454

Comment 12 Amos Kong 2011-12-19 06:00:58 UTC
(In reply to comment #11)
> I used this guide to setup an nfs server :
> http://aaronwalrath.wordpress.com/2011/03/18/configure-nfs-server-v3-and-v4-on-scientific-linux-6-and-red-hat-enterprise-linux-rhel-6/
> 
> When I try to start a VM using nfs I get the following error
> 
> [root@vmhost1a libvirt]# virsh create nfstest.xml

please attach your xml file.

> error: Failed to create domain from nfstest.xml
> error: unable to set user and group to '107:107' on
> '/var/lib/libvirt/images/W2K8R2DC-disk0': Invalid argument
> nfs is mounted on /var/lib/libvirt/images
> 
> It resembles: https://bugzilla.redhat.com/show_bug.cgi?id=709454

it's libvirt bug.

I am not clear about your test evn, could you help to test by qemu cmdline directly? otherwise, we always be blocked by other problem.

Comment 13 Maurits van de Lande 2011-12-19 15:07:13 UTC
>it's libvirt bug.
>could you help to test by qemu cmdline 
>directly? otherwise, we always be blocked by other problem.

Oké, I have always used virsh. I'll try qemu instead.

I'll also upgrade the system to EL6.2 soon, it includes a newer libvirt release.

Comment 14 Maurits van de Lande 2012-01-11 21:35:35 UTC
I tested the live migration with vhost_net driver enabled on el6.2 (CentOS 6.2). This time it all worked perfectly. Even the live migration was noticably faster then before. 

It looks like this bug is solved.