Bug 289841 - Xen Live Migration of x86_64 HVM guests cause target dom0/Hypervisor to crash
Xen Live Migration of x86_64 HVM guests cause target dom0/Hypervisor to crash
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.1
All Linux
urgent Severity medium
: rc
: ---
Assigned To: Chris Lalancette
Martin Jenner
GSSApproved
: ZStream
Depends On:
Blocks: 424241
  Show dependency treegraph
 
Reported: 2007-09-13 15:14 EDT by Jan Mark Holzer
Modified: 2008-08-18 17:02 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-08-18 17:02:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jan Mark Holzer 2007-09-13 15:14:25 EDT
Description of problem:

Using Xen's (live) migration to move a x86_64 HVM guest to another dom0/HV will
 cause the target dom0/HV to crash as soon as the guest has been migrated

Version-Release number of selected component (if applicable):

Guest:

RHEL4u4 x86_64
RHEL3u9 x86_64

Host :
------
2.6.18-45.el5xen #1 SMP Tue Sep 4 17:16:05 EDT 2007 x86_64

xen-libs-3.0.3-38.el5
python-virtinst-0.103.0-3.el5
libvirt-0.2.3-9.el5
virt-top-0.3.2.3-1
kernel-xen-2.6.18-45.el5
kmod-gfs-xen-0.1.19-4.el5
kmod-gfs-xen-0.1.18-2.el5
xen-libs-3.0.3-38.el5
virt-manager-0.4.0-3.el5
libvirt-0.2.3-9.el5
libvirt-python-0.2.3-9.el5
kernel-xen-2.6.18-44.el5
xen-3.0.3-38.el5


How reproducible:

Start a x86_64 HVM guest and use "xm migrate --live HVMguest Targetdom0

Steps to Reproduce:
1. Start a HVM x86_64 guest
2. Initiate live migration (xm migrate --live ....)
3. Watch the guest appear at the new target host (xm li/virsh list)
4. As soon as the guest has ballooned up to its memory footprint the
   target host / HV will crash

Actual results:

Target hypervisor/dom0 crashes upon migration of a x86_64 guest

Expected results:

Successful migrate a x86_64 HVM guest between host/HyperVisors

Additional info:

Tried the same migration with a number of i386 guests (rhel3u9/rhel4u4/Windows
2003/AS2.1) and they all worked fine .

Hosts used for testing (and still available) are buzz/woodie
Comment 1 Chris Lalancette 2007-09-27 11:52:43 EDT
OK, from discussions with Jan, it seems like this is limited to a subset of
machines.  It also seems like this is limited to a particular direction; i.e.
migrating a guest *from* Caneland *to* a Woodcrest.  I captured a stack trace,
it looks like this:

(XEN) ----[ Xen-3.1.0-45.el5  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff83000016b97f>] shadow_set_l1e+0x2f/0x1b0
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff8180801bf198   rcx: 000000000011422c
(XEN) rdx: 00000000788f9067   rsi: ffff8180801bf198   rdi: ffff8300002de080
(XEN) rbp: ffff8300002de080   rsp: ffff830000fefc58   r8:  ffff830000fefe38
(XEN) r9:  0000000000000006   r10: 0000000000000001   r11: ffff8300002fc080
(XEN) r12: 0000000000000000   r13: ffff830000feff28   r14: 000000000011422c
(XEN) r15: ffff8180801bf198   cr0: 000000008005003b   cr4: 00000000000026b0
(XEN) cr3: 000000000114d000   cr2: ffff8180801bf198
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff830000fefc58:
(XEN)    00000000788f9067 ffff8180801bf198 ffff8300002de080 ffff8300002fc080
(XEN)    ffff830000feff28 0000010037e33f38 0000000000037e00 ffff8300001708f9
(XEN)    0000000000000000 0000000000000000 00002aaaae2b4000 ffff8803d929d400
(XEN)    ffff830000fefdb8 00002aaaae2b3000 00000000000788f9 000000000011422c
(XEN)    0000000600000000 ffff8300002fd1c0 000000000011422d ffff8140a0602000
(XEN)    000000000000114e ffff8300002fcd00 ffff830000fdc0d8 0000000000000008
(XEN)    ffff8300002fcd00 0000002000000001 ffff8300001c0180 0000000000000003
(XEN)    ffff8300001c5180 ffff8300002a3098 ffff8300002a3080 000000f900000000
(XEN)    ffff83000011acfc 000000000000e008 0000000000000293 ffff830000fefd80
(XEN)    0000000000000202 0000000000002000 ffff8300002fcd78 0000000000000003
(XEN)    ffff8300002de080 ffff8300001f0800 0000000000000000 0000000000000000
(XEN)    0000010037e33f38 ffff830078a6c010 ffff830112041000 ffff830111e42df8
(XEN)    0000000000000000 0000000037e33067 0000000000078a6c 0000000000112041
(XEN)    0000000000111e42 ffffffffffffffff 0000000000002000 ffff8300002de080
(XEN)    ffff83000012a1a9 ffff83000011a6d0 800000011422c063 0000000000000002
(XEN)    00000000788f9067 ffff8300001be180 0000012d0851b708 0000010037e33f38
(XEN)    0000000080000b0e ffff8300002de080 ffff830000feff28 0000000000000000
(XEN)    0000000000000000 ffff83000016437a ffff8300001f1800 ffff830000155805
(XEN)    ffff8300002df778 00000000000000ef 0000000000000000 ffff830000155805
(XEN)    ffff8300002df830 ffff8300002de080 ffff8300002fc080 ffff83000014f14c
(XEN) Xen call trace:
(XEN)    [<ffff83000016b97f>] shadow_set_l1e+0x2f/0x1b0
(XEN)    [<ffff8300001708f9>] sh_page_fault__shadow_4_guest_4+0x8d9/0xeb0
(XEN)    [<ffff83000011acfc>] migrate_timer+0x17c/0x1a0
(XEN)    [<ffff83000012a1a9>] context_switch+0xb19/0xb60
(XEN)    [<ffff83000011a6d0>] add_entry+0x100/0x130
(XEN)    [<ffff83000016437a>] vmx_vmexit_handler+0x32a/0x16f0
(XEN)    [<ffff830000155805>] vlapic_has_interrupt+0x35/0x60
(XEN)    [<ffff830000155805>] vlapic_has_interrupt+0x35/0x60
(XEN)    [<ffff83000014f14c>] cpu_has_pending_irq+0x2c/0x60
(XEN)    [<ffff83000016575f>] vmx_asm_vmexit_handler+0x1f/0x30
(XEN)    
(XEN) Pagetable walk from ffff8180801bf198:
(XEN)  L4[0x103] = 000000000114e063 5555555555555555
(XEN)  L3[0x002] = 000000011422e063 000000000031dc2e
(XEN)  L2[0x000] = 000000011422d063 000000000031dc2d 
(XEN)  L1[0x1bf] = 800000011422c063 000000000031dc2c
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 2:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0009]
(XEN) Faulting linear address: ffff8180801bf198
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...

Chris Lalancette
Comment 2 Chris Lalancette 2007-10-16 19:44:15 EDT
OK, I did some more digging on this.  It looks like the place it crashes is in
shadow_set_l1e+0x2f, which happens to be:

arch/x86/mm/shadow/multi.c:shadow_set_l1e():

(gdb) list *(shadow_set_l1e+0x2f)
0xffff8300001763ef is in shadow_set_l1e (multi.c:1115).
1110        int flags = 0;
1111        struct domain *d = v->domain;
1112        shadow_l1e_t old_sl1e;
1113        ASSERT(sl1e != NULL);
1114        
1115        old_sl1e = *sl1e;
1116
1117        if ( old_sl1e.l1 == new_sl1e.l1 ) return 0; /* Nothing to do */
1118        
1119        if ( (shadow_l1e_get_flags(new_sl1e) & _PAGE_PRESENT)

It's on the dereference of sl1e, which is passed in.  This was called from
sh_page_fault(), from the ptr_sl1e variable.  %rsi contains the faulting
address, which also shows up in the pagetable walk at the end.  So it was
obviously filled in with a bogus address.

Chris Lalancette
Comment 3 RHEL Product and Program Management 2007-10-16 22:25:01 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 7 Bill Burns 2007-12-07 11:07:46 EST
Changed component to kernel-xen.
Comment 10 Chris Lalancette 2007-12-18 17:09:11 EST
Heh.  Well, there is good news and bad news on this bug.  The good news is that
with the 3.1.2 hypervisor, the dom0 crash no longer happens.  The bad news is
that the migration fails completely, and the guest is sort of left in limbo (on
the source machine, the qemu-dm process dies, but the guest is still alive in
the hypervisor's eyes).  This will need further debugging.

Chris Lalancette
Comment 11 Bill Burns 2008-04-01 10:08:58 EDT
Clearing blocker flag as we do not feel this should block RHEL 5.2.
Comment 13 RHEL Product and Program Management 2008-06-09 18:00:34 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 14 Chris Lalancette 2008-08-18 09:08:14 EDT
Jan,
    Do you know if this is still an issue?  I'm thinking this is probably fixed in 5.2, but I'm not certain.

Thanks,
Chris Lalancette
Comment 15 Jan Mark Holzer 2008-08-18 17:01:17 EDT
Hi Chris,

I just re-run my tests between woodcrest and Caneland systems and indeed the guest will now migrate without a problem .
So I'd say it's fixed :)

- Jan
Comment 16 Chris Lalancette 2008-08-18 17:02:38 EDT
Jan,
    Great, thanks.  I'll mark this as CURRENTRELEASE.

Chris Lalancette

Note You need to log in before you can comment on or make changes to this bug.