Description of problem: The Windows XP guest hang up after migration on AMD platform, can't response when user access it via vnc. Didn't hit this problem on Intel platform. And the problem didn't happend with Win2003 32bit guest. Can reproduce it with both live/normal migration command, both local migration and remote migration. Version-Release number of selected component (if applicable): RHEL-Server-5.5 x86_64 host with package: kernel-2.6.18-194.el5 xen-3.0.3-105.el5 How reproducible: 95% (only see 1 success against more than 20 failures) Steps to Reproduce: 1. install a clean windows xp 32bit guest, no extra apps such as pv driver. 2. create the guest $ cat xp.cfg name='xp' maxmem = 1024 memory = 1024 vcpus = 1 builder = "hvm" kernel = "/usr/lib/xen/boot/hvmloader" boot = "c" pae = 1 acpi = 1 apic = 1 localtime = 0 on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart" sdl = 0 vnc = 1 vncunused = 1 vnclisten = "0.0.0.0" device_model = "/usr/lib64/xen/bin/qemu-dm" disk = [ "file:/share/WinXP-32-hvm.raw,hda,w" ] $ xm create xp.cfg 3. migrate the guest $ xm migrate 1 127.0.0.1 after migration, the guest state may be '-b----' or 'r-----', but has no response when access it via vnc. also for 'migrate -l <DomU> <local_ip>' or 'migrate [-l] <DomU> <remote_ip>' $ ps aux | grep qemu | grep -v grep # before migration root 4539 4.9 0.3 91612 17364 ? Rl 16:23 0:03 /usr/lib64/xen/bin/qemu-dm -d 1 -m 1024 -boot c -vcpus 1 -acpi -domain-name xp -net nic,vlan=1,macaddr=00:16:36:30:00:22,model=rtl8139 -net tap,vlan=1,bridge=xenbr0 -vnc 0.0.0.0:1 -vncunused $ ps aux | grep qemu | grep -v grep # after migration root 4962 0.8 0.4 91596 17260 ? Sl 16:32 0:00 /usr/lib64/xen/bin/qemu-dm -d 2 -m 1024 -boot c -vcpus 1 -acpi -domain-name xp -net nic,vlan=1,macaddr=00:16:36:30:00:22,model=rtl8139 -net tap,vlan=1,bridge=xenbr0 -vnc 0.0.0.0:2 -vncunused -loadvm /var/lib/xen/qemu-save-2.img when perform local migration, there is error message in xend.log: [...] [2010-06-18 18:40:34 xend.XendDomainInfo 3557] DEBUG (XendDomainInfo:944) XendDomainInfo.completeRestore done [2010-06-18 18:40:34 xend 3557] DEBUG (DevController:158) Waiting for devices vif. [2010-06-18 18:40:34 xend 3557] DEBUG (DevController:158) Waiting for devices usb. [2010-06-18 18:40:34 xend 3557] DEBUG (DevController:158) Waiting for devices vbd. [2010-06-18 18:40:34 xend 3557] DEBUG (DevController:164) Waiting for 768. [2010-06-18 18:40:34 xend.XendDomainInfo 3557] DEBUG (XendDomainInfo:1250) XendDomainInfo.handleShutdownWatch [2010-06-18 18:40:34 xend 3557] DEBUG (DevController:509) hotplugStatusCallback /local/domain/0/backend/vbd/2/768/hotplug-status. [2010-06-18 18:40:34 xend 3557] DEBUG (DevController:523) hotplugStatusCallback 5. [2010-06-18 18:40:34 xend 3557] ERROR (XendCheckpoint:277) Device 768 (vbd) could not be connected. File /share/WinXP-32-hvm.raw is loopback-mounted through /dev/loop0, which is mounted in a guest domain, and so cannot be mounted now. Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 275, in restore dominfo.waitForDevices() # Wait for backends to set up File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2371, in waitForDevices self.waitForDevices_(c) File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1429, in waitForDevices_ return self.getDeviceController(deviceClass).waitForDevices() File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 160, in waitForDevices return map(self.waitForDevice, self.deviceIDs()) File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 194, in waitForDevice raise VmError("Device %s (%s) could not be connected.\n%s" % VmError: Device 768 (vbd) could not be connected. File /share/WinXP-32-hvm.raw is loopback-mounted through /dev/loop0, which is mounted in a guest domain, and so cannot be mounted now. [...] Actual results: guest hang up after migration Expected results: guest should work well after migration Additional info: 1. only happen on AMD host 2. happen with WindowsXP 32bit guest, not happen with Win2003 32/64bit guest. 3. Also tried save/restore guest, it may hang up after save/restore after the 2 times, but there is no error message in xend.log, this also only happend on AMD host with WindowsXP guest.
Created attachment 425100 [details] xm dmesg log [1] create WinXP DomU (id=1) [2] migrate 1 127.0.0.1 (id=2) DomU hang after step 2 ( 95% reproducible ) [3] create WinXP DomU (id=3) [4] xm save 3 vm-3.img [5] xm restore vm-3.img (id=4) DomU hang after step 5 ( easy to reproduce after repeat step 4, 5 for 2 times)
Created attachment 425101 [details] xend.log [1] create WinXP DomU (id=1) [2] migrate 1 127.0.0.1 (id=2) DomU hang after step 2 ( 95% reproducible ) [3] create WinXP DomU (id=3) [4] xm save 3 vm-3.img [5] xm restore vm-3.img (id=4) DomU hang after step 5 ( easy to reproduce after repeat step 4, 5 for 2 times)
Created attachment 425104 [details] host dmidecode
Does it happen only on localhost migration or remote migration as well? Michal
both localhost migration and remote migration, and both live/normal migration.
This error in xend.log is worth investigating a bit: File /share/WinXP-32-hvm.raw is loopback-mounted through /dev/loop0, which is mounted in a guest domain, and so cannot be mounted now. Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 275, in restore dominfo.waitForDevices() # Wait for backends to set up File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2371, in waitForDevices self.waitForDevices_(c) File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1429, in waitForDevices_ return self.getDeviceController(deviceClass).waitForDevices() File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 160, in waitForDevices return map(self.waitForDevice, self.deviceIDs()) File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 194, in waitForDevice raise VmError("Device %s (%s) could not be connected.\n%s" % VmError: Device 768 (vbd) could not be connected. File /share/WinXP-32-hvm.raw is loopback-mounted through /dev/loop0, which is mounted in a guest domain, and so cannot be mounted now. That said, do the guests have PV drivers installed? If not, and if the PV drivers help, this bug would decrease of importance immensely. The main point is that it is almost impossible to see where the guest is stuck (windbg obviously cannot read xen core dumps).
(In reply to comment #6) > ... > That said, do the guests have PV drivers installed? No PV drivers installed > If not, and if the PV drivers help Yes, the PV drivers helped. Migration (both normal/live) success after PV driver installed, but I'm still confused why it failed only on AMD host, and Win2003 32/64bit guests can migrate success on AMD host without PV driver.
Even upstream doesn't have working migration of non-PV Windows guests, so I think this bug should be very low priority. Investigation of Intel vs. AMD behavior can still be interesting, so for now I'm leaving the bug open.
The error in xend.log (comment 6) suggests this is a dup of bug 622501, or at least blocked by that bug. To be revisited after that patch is acked.
*** Bug 684709 has been marked as a duplicate of this bug. ***
Revisiting the bug. Apparently it still happens even with the patch for bug 622501. The patch at http://permalink.gmane.org/gmane.comp.emulators.xen.devel/94955 could help, we probably want that patch anyway.
Qixiang, can you please test with package at: https://brewweb.devel.redhat.com/taskinfo?taskID=3176319 Thanks
I tried the patch pointed out in comment 13, but it didn't help. I also turned on vlapic timer debug output (xen.gz command line 'hvm_debug=128') and tried to find some clues, but all I found was likely a red herring. I saw that the 32-bit winxp guest switches the divisor to 1 from the default of 2. However, after a restore the default of 2 is restored when the vcpu is reset. I hacked the HV to not allow the divisor to be changed, thus leaving it 2, but that didn't help either...
Some notes from testing on an AMD family 10h machine with xen-libs-3.0.3-126.el5 kernel-xen-2.6.18-248.el5 xen-3.0.3-126.el5 * doesn't reproduce every time, but more frequently than not - I did have runs where s/r worked repetitively, but also runs where it didn't * the clock still ticks (seen with 'xm list') when "moving" the mouse on a frozen guest's VNC, i.e. the cursor doesn't move with the mouse, but the clock ticks like something is happening - a couple times after a restore I saw the cursor had even moved * the guest got a BSOD 0x000000B8 once * there doesn't seem to be anything different in 'xm dmesg' between the working/non-working cases (but I need to double check that) Adding Frank to CC for any AMD insight he may have.
This looks like a bug in the AMD processor driver inside Windows. To get a definitive answer, check in your guest for a C:\Windows\MEMORY.DMP file. If you have it, we can check what caused the BSOD (which is a pretty rare error code, and almost always due to a programming error).
This guest doesn't s/r on a more recent xen host either. The very first s/r I tried on SLES (4.0.1_21326_06-0.4.1, 2.6.32.27-0.2-xen) gave me a BSOD 0x0000007F. I allowed the memory to dump and it created a C:\Windows\MEMORY.DMP file. Then I tried another s/r round and got the same BSOD. Paolo, if you'd like to take a look at the memory.dmp file it's in my scratch dir, ~drjones/memory.zip.
See also http://xenbits.xen.org/hg/staging/xen-unstable.hg/rev/17235 svm: Reported SS.DPL must equal CPL, as this is assumed by generic HVM http://lists.xensource.com/archives/html/xen-devel/2008-03/msg00295.html mentions it fixes a failure when migrating non-PV-on-HVM guests on AMD.
Paolo posted http://post-office.corp.redhat.com/archives/rhkernel-list/2011-July/msg00608.html
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Patch(es) available in kernel-2.6.18-282.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Can you get a BSOD by trying multiple times? Analyzing a MEMORY.DMP from there would be easier. However, I can try with xm dump-core too.
(In reply to comment #29) > Can you get a BSOD by trying multiple times? Analyzing a MEMORY.DMP from there > would be easier. However, I can try with xm dump-core too. I failed to get a BSOD with winxp and win2k3, will find different AMD processor to have a try.
*** Bug 723854 has been marked as a duplicate of this bug. ***