605617 – windows xp guest hang after migration on AMD platform

Bug 605617 - windows xp guest hang after migration on AMD platform

Summary: windows xp guest hang after migration on AMD platform

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Paolo Bonzini
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	684709 723854 (view as bug list)
Depends On:	622501
Blocks:	514491 KernelXenUpstreamHV
TreeView+	depends on / blocked

Reported:	2010-06-18 12:21 UTC by Qixiang Wan
Modified:	2011-11-08 16:41 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-11-08 16:41:59 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
xm dmesg log (20.67 KB, text/plain) 2010-06-18 12:28 UTC, Qixiang Wan	no flags	Details
xend.log (248.44 KB, text/plain) 2010-06-18 12:28 UTC, Qixiang Wan	no flags	Details
host dmidecode (17.72 KB, text/plain) 2010-06-18 12:34 UTC, Qixiang Wan	no flags	Details
View All

Description Qixiang Wan 2010-06-18 12:21:05 UTC

Description of problem:
The Windows XP guest hang up after migration on AMD platform, can't response when user access it via vnc. Didn't hit this problem on Intel platform. And the problem didn't happend with Win2003 32bit guest.

Can reproduce it with both live/normal migration command, both local migration and remote migration.

Version-Release number of selected component (if applicable):
RHEL-Server-5.5 x86_64 host with package:
kernel-2.6.18-194.el5
xen-3.0.3-105.el5

How reproducible:
95% (only see 1 success against more than 20 failures)

Steps to Reproduce:

1. install a clean windows xp 32bit guest, no extra apps such as pv driver.

2. create the guest
$ cat xp.cfg 
name='xp'
maxmem = 1024
memory = 1024
vcpus = 1
builder = "hvm"
kernel = "/usr/lib/xen/boot/hvmloader"
boot = "c"
pae = 1
acpi = 1
apic = 1
localtime = 0
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
sdl = 0
vnc = 1
vncunused = 1
vnclisten = "0.0.0.0"
device_model = "/usr/lib64/xen/bin/qemu-dm"
disk = [ "file:/share/WinXP-32-hvm.raw,hda,w" ]

$ xm create xp.cfg

3. migrate the guest 
$ xm migrate 1 127.0.0.1

after migration, the guest state may be '-b----' or 'r-----', but has no response when access it via vnc. also for 'migrate -l <DomU> <local_ip>' or 'migrate [-l] <DomU> <remote_ip>'

$ ps aux | grep qemu | grep -v grep  # before migration
root      4539  4.9  0.3  91612 17364 ?        Rl   16:23   0:03 /usr/lib64/xen/bin/qemu-dm -d 1 -m 1024 -boot c -vcpus 1 -acpi -domain-name xp -net nic,vlan=1,macaddr=00:16:36:30:00:22,model=rtl8139 -net tap,vlan=1,bridge=xenbr0 -vnc 0.0.0.0:1 -vncunused

$ ps aux | grep qemu | grep -v grep  # after migration
root      4962  0.8  0.4  91596 17260 ?        Sl   16:32   0:00 /usr/lib64/xen/bin/qemu-dm -d 2 -m 1024 -boot c -vcpus 1 -acpi -domain-name xp -net nic,vlan=1,macaddr=00:16:36:30:00:22,model=rtl8139 -net tap,vlan=1,bridge=xenbr0 -vnc 0.0.0.0:2 -vncunused -loadvm /var/lib/xen/qemu-save-2.img

when perform local migration, there is error message in xend.log:
[...]
[2010-06-18 18:40:34 xend.XendDomainInfo 3557] DEBUG (XendDomainInfo:944) XendDomainInfo.completeRestore done
[2010-06-18 18:40:34 xend 3557] DEBUG (DevController:158) Waiting for devices vif.
[2010-06-18 18:40:34 xend 3557] DEBUG (DevController:158) Waiting for devices usb.
[2010-06-18 18:40:34 xend 3557] DEBUG (DevController:158) Waiting for devices vbd.
[2010-06-18 18:40:34 xend 3557] DEBUG (DevController:164) Waiting for 768.
[2010-06-18 18:40:34 xend.XendDomainInfo 3557] DEBUG (XendDomainInfo:1250) XendDomainInfo.handleShutdownWatch
[2010-06-18 18:40:34 xend 3557] DEBUG (DevController:509) hotplugStatusCallback /local/domain/0/backend/vbd/2/768/hotplug-status.
[2010-06-18 18:40:34 xend 3557] DEBUG (DevController:523) hotplugStatusCallback 5.
[2010-06-18 18:40:34 xend 3557] ERROR (XendCheckpoint:277) Device 768 (vbd) could not be connected.
File /share/WinXP-32-hvm.raw is loopback-mounted through /dev/loop0,
which is mounted in a guest domain,
and so cannot be mounted now.
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 275, in restore
    dominfo.waitForDevices() # Wait for backends to set up
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2371, in waitForDevices
    self.waitForDevices_(c)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1429, in waitForDevices_
    return self.getDeviceController(deviceClass).waitForDevices()
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 160, in waitForDevices
    return map(self.waitForDevice, self.deviceIDs())
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 194, in waitForDevice
    raise VmError("Device %s (%s) could not be connected.\n%s" %
VmError: Device 768 (vbd) could not be connected.
File /share/WinXP-32-hvm.raw is loopback-mounted through /dev/loop0,
which is mounted in a guest domain,
and so cannot be mounted now.
[...]


Actual results:
guest hang up after migration

Expected results:
guest should work well after migration

Additional info:
1. only happen on AMD host
2. happen with WindowsXP 32bit guest, not happen with Win2003 32/64bit guest.
3. Also tried save/restore guest, it may hang up after save/restore after the 2 times, but there is no error message in xend.log, this also only happend on AMD host with WindowsXP guest.

Comment 1 Qixiang Wan 2010-06-18 12:28:04 UTC

Created attachment 425100 [details]
xm dmesg log

[1] create WinXP DomU (id=1)  
[2] migrate 1 127.0.0.1 (id=2)

DomU hang after step 2 ( 95% reproducible )

[3] create WinXP DomU (id=3)
[4] xm save 3 vm-3.img
[5] xm restore vm-3.img (id=4)

DomU hang after step 5 ( easy to reproduce after repeat step 4, 5 for 2 times)

Comment 2 Qixiang Wan 2010-06-18 12:28:42 UTC

Created attachment 425101 [details]
xend.log

[1] create WinXP DomU (id=1)  
[2] migrate 1 127.0.0.1 (id=2)

DomU hang after step 2 ( 95% reproducible )

[3] create WinXP DomU (id=3)
[4] xm save 3 vm-3.img
[5] xm restore vm-3.img (id=4)

DomU hang after step 5 ( easy to reproduce after repeat step 4, 5 for 2 times)

Comment 3 Qixiang Wan 2010-06-18 12:34:46 UTC

Created attachment 425104 [details]
host dmidecode

Comment 4 Michal Novotny 2010-06-18 13:47:54 UTC

Does it happen only on localhost migration or remote migration as well?

Michal

Comment 5 Qixiang Wan 2010-06-18 17:18:20 UTC

both localhost migration and remote migration, and both live/normal migration.

Comment 6 Paolo Bonzini 2010-06-24 15:07:21 UTC

This error in xend.log is worth investigating a bit:

File /share/WinXP-32-hvm.raw is loopback-mounted through /dev/loop0,
which is mounted in a guest domain,
and so cannot be mounted now.
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendCheckpoint.py", line 275, in restore
    dominfo.waitForDevices() # Wait for backends to set up
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 2371, in waitForDevices
    self.waitForDevices_(c)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1429, in waitForDevices_
    return self.getDeviceController(deviceClass).waitForDevices()
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 160, in waitForDevices
    return map(self.waitForDevice, self.deviceIDs())
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 194, in waitForDevice
    raise VmError("Device %s (%s) could not be connected.\n%s" %
VmError: Device 768 (vbd) could not be connected.
File /share/WinXP-32-hvm.raw is loopback-mounted through /dev/loop0,
which is mounted in a guest domain, and so cannot be mounted now.

That said, do the guests have PV drivers installed?  If not, and if the PV drivers help, this bug would decrease of importance immensely.  The main point is that it is almost impossible to see where the guest is stuck (windbg obviously cannot read xen core dumps).

Comment 7 Qixiang Wan 2010-06-28 09:31:58 UTC

(In reply to comment #6)
> ...
> That said, do the guests have PV drivers installed?  
No PV drivers installed

> If not, and if the PV drivers help
Yes, the PV drivers helped. Migration (both normal/live) success after PV driver installed, but I'm still confused why it failed only on AMD host, and Win2003 32/64bit guests can migrate success on AMD host without PV driver.

Comment 8 Paolo Bonzini 2010-06-28 11:44:15 UTC

Even upstream doesn't have working migration of non-PV Windows guests, so I think this bug should be very low priority.

Investigation of Intel vs. AMD behavior can still be interesting, so for now I'm leaving the bug open.

Comment 11 Paolo Bonzini 2011-01-18 12:30:00 UTC

The error in xend.log (comment 6) suggests this is a dup of bug 622501, or at least blocked by that bug.  To be revisited after that patch is acked.

Comment 12 Paolo Bonzini 2011-03-14 12:20:47 UTC

*** Bug 684709 has been marked as a duplicate of this bug. ***

Comment 13 Paolo Bonzini 2011-03-14 12:22:25 UTC

Revisiting the bug.  Apparently it still happens even with the patch for bug 622501.

The patch at http://permalink.gmane.org/gmane.comp.emulators.xen.devel/94955 could help, we probably want that patch anyway.

Comment 14 Miroslav Rezanina 2011-03-14 14:34:48 UTC

Qixiang,
can you please test with package at:

https://brewweb.devel.redhat.com/taskinfo?taskID=3176319

Thanks

Comment 16 Andrew Jones 2011-03-14 18:13:57 UTC

I tried the patch pointed out in comment 13, but it didn't help. I also turned on vlapic timer debug output (xen.gz command line 'hvm_debug=128') and tried to find some clues, but all I found was likely a red herring. I saw that the 32-bit winxp guest switches the divisor to 1 from the default of 2. However, after a restore the default of 2 is restored when the vcpu is reset. I hacked the HV to not allow the divisor to be changed, thus leaving it 2, but that didn't help either...

Comment 17 Andrew Jones 2011-03-18 17:58:06 UTC

Some notes from testing on an AMD family 10h machine with

xen-libs-3.0.3-126.el5
kernel-xen-2.6.18-248.el5
xen-3.0.3-126.el5

* doesn't reproduce every time, but more frequently than not
   - I did have runs where s/r worked repetitively, but also runs where it didn't
* the clock still ticks (seen with 'xm list') when "moving" the mouse on a frozen guest's VNC, i.e. the cursor doesn't move with the mouse, but the clock ticks like something is happening
   - a couple times after a restore I saw the cursor had even moved
* the guest got a BSOD 0x000000B8 once
* there doesn't seem to be anything different in 'xm dmesg' between the working/non-working cases (but I need to double check that)

Adding Frank to CC for any AMD insight he may have.

Comment 18 Paolo Bonzini 2011-03-19 09:59:39 UTC

This looks like a bug in the AMD processor driver inside Windows.  To get a definitive answer, check in your guest for a C:\Windows\MEMORY.DMP file.  If you have it, we can check what caused the BSOD (which is a pretty rare error code, and almost always due to a programming error).

Comment 19 Andrew Jones 2011-04-04 12:21:34 UTC

This guest doesn't s/r on a more recent xen host either. The very first s/r I tried on SLES (4.0.1_21326_06-0.4.1, 2.6.32.27-0.2-xen) gave me a BSOD 0x0000007F. I allowed the memory to dump and it created a C:\Windows\MEMORY.DMP file. Then I tried another s/r round and got the same BSOD.

Paolo, if you'd like to take a look at the memory.dmp file it's in my scratch dir, ~drjones/memory.zip.

Comment 21 Paolo Bonzini 2011-05-30 14:17:49 UTC

See also http://xenbits.xen.org/hg/staging/xen-unstable.hg/rev/17235
svm: Reported SS.DPL must equal CPL, as this is assumed by generic HVM

http://lists.xensource.com/archives/html/xen-devel/2008-03/msg00295.html
mentions it fixes a failure when migrating non-PV-on-HVM guests on AMD.

Comment 24 Laszlo Ersek 2011-07-13 17:21:51 UTC

Paolo posted
http://post-office.corp.redhat.com/archives/rhkernel-list/2011-July/msg00608.html

Comment 25 RHEL Program Management 2011-08-04 04:12:04 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 27 Jarod Wilson 2011-08-23 13:58:48 UTC

Patch(es) available in kernel-2.6.18-282.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 29 Paolo Bonzini 2011-08-29 13:57:36 UTC

Can you get a BSOD by trying multiple times?  Analyzing a MEMORY.DMP from there would be easier.  However, I can try with xm dump-core too.

Comment 30 Qixiang Wan 2011-08-30 07:44:09 UTC

(In reply to comment #29)
> Can you get a BSOD by trying multiple times?  Analyzing a MEMORY.DMP from there
> would be easier.  However, I can try with xm dump-core too.

I failed to get a BSOD with winxp and win2k3, will find different AMD processor to have a try.

Comment 31 Paolo Bonzini 2011-08-30 15:09:18 UTC

*** Bug 723854 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.