Bug 1107835 - Windows XP VM hangs after live migration
Summary: Windows XP VM hangs after live migration
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-webadmin
Version: 3.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Francesco Romani
QA Contact: Ilanit Stein
URL:
Whiteboard: virt
Depends On:
Blocks: 1083529
TreeView+ depends on / blocked
 
Reported: 2014-06-10 18:26 UTC by Markus Stockhausen
Modified: 2016-02-10 19:49 UTC (History)
11 users (show)

Fixed In Version: ovirt-3.5.0-beta2
Clone Of:
Environment:
Last Closed: 2014-10-17 12:35:13 UTC
oVirt Team: Virt
Embargoed:


Attachments (Terms of Use)
vdsm.log (98.94 KB, text/plain)
2014-06-10 18:28 UTC, Markus Stockhausen
no flags Details
video (1.99 MB, video/x-ms-wmv)
2014-06-10 18:50 UTC, Markus Stockhausen
no flags Details
taskmanager of stalled vm (2.03 MB, video/x-ms-wmv)
2014-06-11 19:26 UTC, Markus Stockhausen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 29233 0 master MERGED vm: hyperv: make hw clock friendlier to windows Never
oVirt gerrit 30255 0 ovirt-3.5 MERGED vm: hyperv: make hw clock friendlier to windows Never

Description Markus Stockhausen 2014-06-10 18:26:56 UTC
Description of problem:

A Windows XP VM stalls after live migration


Version-Release number of selected component (if applicable):

OVirt 3.4.1
Hypervisor nodes: FC20
Hypervisor kernel: 3.14.5-200
qemu 1.6.2
Spice 0.12.4

How reproducible:

100%

Steps to Reproduce:
1. Configure Windows XP VM 
1a. Virtio network card / Virto system disk / SPICE display
2. Start XP VM
3. Logon to VM via SPICE console
4. Open  explorer window
3. Wait 10 minutes
4. Migrate VM
5. Open SPICE console
6. try to open start menu

Actual results:

User cannot open startmenu, open explorer window cannot be dragged any longer

Expected results:

VM should work normally

Additional info:

This is a spin of BZ1104697. With this bug the reason for the VM stall should be analyzed. The original bug is for analysis of the slow SPICE console that can be seen during the tests.

Comment 1 Markus Stockhausen 2014-06-10 18:28:33 UTC
Created attachment 907354 [details]
vdsm.log

Comment 2 Markus Stockhausen 2014-06-10 18:50:46 UTC
Created attachment 907359 [details]
video

Comment 3 Markus Stockhausen 2014-06-10 18:53:33 UTC
Video attached.

The VM has a XP telnet server installed. As you can see the machine can respond to network packets. Nevertheless in its stalled state it does not allow to login to the telnet server (which is possible when machine has a normal state).

Comment 4 Markus Stockhausen 2014-06-11 19:25:20 UTC
A few further tests revealed a situation where at least the task manager was still active and responsive in the SPICE console. Nevertheless it did not provide any updates.

Comment 5 Markus Stockhausen 2014-06-11 19:26:08 UTC
Created attachment 907818 [details]
taskmanager of stalled vm

Comment 6 Markus Stockhausen 2014-06-14 18:14:48 UTC
My observations led to the conclusion that it must be somehow related to guest clock handling. So I installed the old Windows 98 executable choice.exe. This prompts for an input and allows to end without input after a predefined time.

E.G.

choice.exe /t:y,3 InputSomething

- Will prompt "InputSomething[Y,N]?"
- Allows the user to input Y or N key
- Will end after 3 seconds if no input is given

After start of VM the program will behave as expected. It will end after 3 seconds. If the machine has gone into pathologic state (after X migrations), the time trigger does not work anymore. Nevertheless program it will react on user inputs.

To further nail things down I followed several advices about guest timing issues. My XP SP3 VM has following additional settings active:

- Parameter /usepmtimer has been added to boot.ini entry for system start

- HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\Processor\Start has been set to  a value of 4. This avoids processor idling and in OVirt VM CPU ist always 100%

Nevertheless both settings do net help to mitigate the problem.

Comment 7 Markus Stockhausen 2014-06-14 21:02:25 UTC
Don't really know why, but inspired by the attached threads I added a hook to modify libvirt xml to 

  ...
  <timer name='rtc' track='guest'/>
  ...

This can be achieved with the following python script:

#!/usr/bin/python
import os
import sys
import hooking
import traceback
domxml = hooking.read_domxml()
t = domxml.getElementsByTagName('timer')[0]
t.setAttribute('track','guest')
hooking.write_domxml(domxml)

This calls qemu with the extended command line "-rtc clock=vm,...". Afterwards I generated a testscript on the engine that migrates the vm every 30 seconds via:

#!/bin/bash
i=0
while [ 1 -eq 1 ]; do
  ovirt-shell -c -E "action vm colvm36 migrate"
  i=`expr $i + 1`
  echo Machine migrated $i times.
  sleep 30
done

While I'm writing these lines my XP VM has passed 57 consecutive online migrations without any problem. This is a stellar jump versus the usual hangups after 5-10 online migrations in the default configuration.

Links: 
http://stackoverflow.com/questions/17784178/qemu-failed-in-loadvm-while-the-guest-system-is-windows-xp
http://lists.gnu.org/archive/html/qemu-devel/2009-10/msg00762.html

Comment 8 Markus Stockhausen 2014-06-14 21:03:47 UTC
P.S. Ths XP is a domain member.

Comment 9 Markus Stockhausen 2014-06-17 19:46:43 UTC
Similar bug where qemu parametrization could be enhanced: BZ1110305

Comment 10 Michal Skrivanek 2014-06-25 14:41:46 UTC
do you see the same improvement in bug 1110305 after your modification ?

Comment 11 Markus Stockhausen 2014-06-25 16:57:27 UTC
In contrast to BZ1110305 i can confirm that setting the track=guest option improves the stability of the VM dramatically during live migrations. Sorry for having no other technical explanations.

What makes this bug different from BZ1110305:

- relax option of hypervisor will not affect Windowx XP VM behaviour (as far as I understand)

- this bug is about live migrations

- BZ1110305 is about runnning VMs that give BSOD due to high load.

Comment 12 Francesco Romani 2014-06-26 07:28:39 UTC
VDSM patch posted for review

Comment 13 Francesco Romani 2014-06-26 07:30:24 UTC
Markus, thanks for the extensive investigation and for the excellent BZ entry!

Comment 14 Markus Stockhausen 2014-06-26 07:49:27 UTC
Thanks for the quick fix. Just to understand it right.

The patches will allow to set a "hyperv" flag. With that two switches are enabled.

- qemu relax_hv option => To improve Windows 7 stability
- qemu clock track=guest option => To improve Windows Xp migration stability.

So we make no difference between XP and Win7 and simply activate these switches for all Windows VMs?

Comment 15 Francesco Romani 2014-06-26 08:03:01 UTC
This is correct. These patches are part of a series which will collectively improve the hyperv support. This is way alle the settings are toggled by the new 'hypervEnable' boolean.

All the new settings are going in the direction of the libvirt recommended settings, so they should be good for all the windows-es.

Moreover, but I need to check, I'm not sure Engine distinguish between windows releases, e.g. winXP vs win7.

Comment 16 Markus Stockhausen 2014-06-26 08:15:59 UTC
Thanks for the clarification. 

The patches are VDSM only? If yes should I file a new BZ to enable those features in the engine?

Comment 17 Francesco Romani 2014-06-26 08:19:28 UTC
Engine support is required (and already in the works) to fully resolve this BZ.
I don't think you need a separate one.

Comment 18 Francesco Romani 2014-07-15 08:42:27 UTC
VDSM support merged. The new code will be transparently enabled for windows guests once this patch gets merged http://gerrit.ovirt.org/#/c/29238/

Comment 19 Francesco Romani 2014-07-18 08:32:36 UTC
engine support merged in both master and 3.5:

engine master:
http://gerrit.ovirt.org/#/c/29238/

engine 3.5.0
http://gerrit.ovirt.org/#/c/30188/


VDSM master:
http://gerrit.ovirt.org/#/c/27619/
http://gerrit.ovirt.org/#/c/29233/

turns out VDSM patch was merged after 3.5 branched.
Posted backports:
http://gerrit.ovirt.org/#/c/30254/
http://gerrit.ovirt.org/#/c/30255/

Comment 20 Ilanit Stein 2014-08-12 09:19:19 UTC
Verified on ovirt-engine 3.5 - rc1
vdsm vdsm-4.16.1-6.gita4a4614.el6.x86_64

Comment 21 Sandro Bonazzola 2014-10-17 12:35:13 UTC
oVirt 3.5 has been released and should include the fix for this issue.


Note You need to log in before you can comment on or make changes to this bug.