Description of problem: A Windows XP VM stalls after live migration Version-Release number of selected component (if applicable): OVirt 3.4.1 Hypervisor nodes: FC20 Hypervisor kernel: 3.14.5-200 qemu 1.6.2 Spice 0.12.4 How reproducible: 100% Steps to Reproduce: 1. Configure Windows XP VM 1a. Virtio network card / Virto system disk / SPICE display 2. Start XP VM 3. Logon to VM via SPICE console 4. Open explorer window 3. Wait 10 minutes 4. Migrate VM 5. Open SPICE console 6. try to open start menu Actual results: User cannot open startmenu, open explorer window cannot be dragged any longer Expected results: VM should work normally Additional info: This is a spin of BZ1104697. With this bug the reason for the VM stall should be analyzed. The original bug is for analysis of the slow SPICE console that can be seen during the tests.
Created attachment 907354 [details] vdsm.log
Created attachment 907359 [details] video
Video attached. The VM has a XP telnet server installed. As you can see the machine can respond to network packets. Nevertheless in its stalled state it does not allow to login to the telnet server (which is possible when machine has a normal state).
A few further tests revealed a situation where at least the task manager was still active and responsive in the SPICE console. Nevertheless it did not provide any updates.
Created attachment 907818 [details] taskmanager of stalled vm
My observations led to the conclusion that it must be somehow related to guest clock handling. So I installed the old Windows 98 executable choice.exe. This prompts for an input and allows to end without input after a predefined time. E.G. choice.exe /t:y,3 InputSomething - Will prompt "InputSomething[Y,N]?" - Allows the user to input Y or N key - Will end after 3 seconds if no input is given After start of VM the program will behave as expected. It will end after 3 seconds. If the machine has gone into pathologic state (after X migrations), the time trigger does not work anymore. Nevertheless program it will react on user inputs. To further nail things down I followed several advices about guest timing issues. My XP SP3 VM has following additional settings active: - Parameter /usepmtimer has been added to boot.ini entry for system start - HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\Processor\Start has been set to a value of 4. This avoids processor idling and in OVirt VM CPU ist always 100% Nevertheless both settings do net help to mitigate the problem.
Don't really know why, but inspired by the attached threads I added a hook to modify libvirt xml to ... <timer name='rtc' track='guest'/> ... This can be achieved with the following python script: #!/usr/bin/python import os import sys import hooking import traceback domxml = hooking.read_domxml() t = domxml.getElementsByTagName('timer')[0] t.setAttribute('track','guest') hooking.write_domxml(domxml) This calls qemu with the extended command line "-rtc clock=vm,...". Afterwards I generated a testscript on the engine that migrates the vm every 30 seconds via: #!/bin/bash i=0 while [ 1 -eq 1 ]; do ovirt-shell -c -E "action vm colvm36 migrate" i=`expr $i + 1` echo Machine migrated $i times. sleep 30 done While I'm writing these lines my XP VM has passed 57 consecutive online migrations without any problem. This is a stellar jump versus the usual hangups after 5-10 online migrations in the default configuration. Links: http://stackoverflow.com/questions/17784178/qemu-failed-in-loadvm-while-the-guest-system-is-windows-xp http://lists.gnu.org/archive/html/qemu-devel/2009-10/msg00762.html
P.S. Ths XP is a domain member.
Similar bug where qemu parametrization could be enhanced: BZ1110305
do you see the same improvement in bug 1110305 after your modification ?
In contrast to BZ1110305 i can confirm that setting the track=guest option improves the stability of the VM dramatically during live migrations. Sorry for having no other technical explanations. What makes this bug different from BZ1110305: - relax option of hypervisor will not affect Windowx XP VM behaviour (as far as I understand) - this bug is about live migrations - BZ1110305 is about runnning VMs that give BSOD due to high load.
VDSM patch posted for review
Markus, thanks for the extensive investigation and for the excellent BZ entry!
Thanks for the quick fix. Just to understand it right. The patches will allow to set a "hyperv" flag. With that two switches are enabled. - qemu relax_hv option => To improve Windows 7 stability - qemu clock track=guest option => To improve Windows Xp migration stability. So we make no difference between XP and Win7 and simply activate these switches for all Windows VMs?
This is correct. These patches are part of a series which will collectively improve the hyperv support. This is way alle the settings are toggled by the new 'hypervEnable' boolean. All the new settings are going in the direction of the libvirt recommended settings, so they should be good for all the windows-es. Moreover, but I need to check, I'm not sure Engine distinguish between windows releases, e.g. winXP vs win7.
Thanks for the clarification. The patches are VDSM only? If yes should I file a new BZ to enable those features in the engine?
Engine support is required (and already in the works) to fully resolve this BZ. I don't think you need a separate one.
VDSM support merged. The new code will be transparently enabled for windows guests once this patch gets merged http://gerrit.ovirt.org/#/c/29238/
engine support merged in both master and 3.5: engine master: http://gerrit.ovirt.org/#/c/29238/ engine 3.5.0 http://gerrit.ovirt.org/#/c/30188/ VDSM master: http://gerrit.ovirt.org/#/c/27619/ http://gerrit.ovirt.org/#/c/29233/ turns out VDSM patch was merged after 3.5 branched. Posted backports: http://gerrit.ovirt.org/#/c/30254/ http://gerrit.ovirt.org/#/c/30255/
Verified on ovirt-engine 3.5 - rc1 vdsm vdsm-4.16.1-6.gita4a4614.el6.x86_64
oVirt 3.5 has been released and should include the fix for this issue.