Bug 289841

Summary: Xen Live Migration of x86_64 HVM guests cause target dom0/Hypervisor to crash
Product: Red Hat Enterprise Linux 5 Reporter: Jan Mark Holzer <jmh>
Component: kernel-xenAssignee: Chris Lalancette <clalance>
Status: CLOSED CURRENTRELEASE QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: urgent    
Version: 5.1CC: anton, dhoward, jvillalo, k.georgiou, xen-maint
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: GSSApproved
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-18 21:02:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 424241    

Description Jan Mark Holzer 2007-09-13 19:14:25 UTC
Description of problem:

Using Xen's (live) migration to move a x86_64 HVM guest to another dom0/HV will
 cause the target dom0/HV to crash as soon as the guest has been migrated

Version-Release number of selected component (if applicable):

Guest:

RHEL4u4 x86_64
RHEL3u9 x86_64

Host :
------
2.6.18-45.el5xen #1 SMP Tue Sep 4 17:16:05 EDT 2007 x86_64

xen-libs-3.0.3-38.el5
python-virtinst-0.103.0-3.el5
libvirt-0.2.3-9.el5
virt-top-0.3.2.3-1
kernel-xen-2.6.18-45.el5
kmod-gfs-xen-0.1.19-4.el5
kmod-gfs-xen-0.1.18-2.el5
xen-libs-3.0.3-38.el5
virt-manager-0.4.0-3.el5
libvirt-0.2.3-9.el5
libvirt-python-0.2.3-9.el5
kernel-xen-2.6.18-44.el5
xen-3.0.3-38.el5


How reproducible:

Start a x86_64 HVM guest and use "xm migrate --live HVMguest Targetdom0

Steps to Reproduce:
1. Start a HVM x86_64 guest
2. Initiate live migration (xm migrate --live ....)
3. Watch the guest appear at the new target host (xm li/virsh list)
4. As soon as the guest has ballooned up to its memory footprint the
   target host / HV will crash

Actual results:

Target hypervisor/dom0 crashes upon migration of a x86_64 guest

Expected results:

Successful migrate a x86_64 HVM guest between host/HyperVisors

Additional info:

Tried the same migration with a number of i386 guests (rhel3u9/rhel4u4/Windows
2003/AS2.1) and they all worked fine .

Hosts used for testing (and still available) are buzz/woodie

Comment 1 Chris Lalancette 2007-09-27 15:52:43 UTC
OK, from discussions with Jan, it seems like this is limited to a subset of
machines.  It also seems like this is limited to a particular direction; i.e.
migrating a guest *from* Caneland *to* a Woodcrest.  I captured a stack trace,
it looks like this:

(XEN) ----[ Xen-3.1.0-45.el5  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff83000016b97f>] shadow_set_l1e+0x2f/0x1b0
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff8180801bf198   rcx: 000000000011422c
(XEN) rdx: 00000000788f9067   rsi: ffff8180801bf198   rdi: ffff8300002de080
(XEN) rbp: ffff8300002de080   rsp: ffff830000fefc58   r8:  ffff830000fefe38
(XEN) r9:  0000000000000006   r10: 0000000000000001   r11: ffff8300002fc080
(XEN) r12: 0000000000000000   r13: ffff830000feff28   r14: 000000000011422c
(XEN) r15: ffff8180801bf198   cr0: 000000008005003b   cr4: 00000000000026b0
(XEN) cr3: 000000000114d000   cr2: ffff8180801bf198
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff830000fefc58:
(XEN)    00000000788f9067 ffff8180801bf198 ffff8300002de080 ffff8300002fc080
(XEN)    ffff830000feff28 0000010037e33f38 0000000000037e00 ffff8300001708f9
(XEN)    0000000000000000 0000000000000000 00002aaaae2b4000 ffff8803d929d400
(XEN)    ffff830000fefdb8 00002aaaae2b3000 00000000000788f9 000000000011422c
(XEN)    0000000600000000 ffff8300002fd1c0 000000000011422d ffff8140a0602000
(XEN)    000000000000114e ffff8300002fcd00 ffff830000fdc0d8 0000000000000008
(XEN)    ffff8300002fcd00 0000002000000001 ffff8300001c0180 0000000000000003
(XEN)    ffff8300001c5180 ffff8300002a3098 ffff8300002a3080 000000f900000000
(XEN)    ffff83000011acfc 000000000000e008 0000000000000293 ffff830000fefd80
(XEN)    0000000000000202 0000000000002000 ffff8300002fcd78 0000000000000003
(XEN)    ffff8300002de080 ffff8300001f0800 0000000000000000 0000000000000000
(XEN)    0000010037e33f38 ffff830078a6c010 ffff830112041000 ffff830111e42df8
(XEN)    0000000000000000 0000000037e33067 0000000000078a6c 0000000000112041
(XEN)    0000000000111e42 ffffffffffffffff 0000000000002000 ffff8300002de080
(XEN)    ffff83000012a1a9 ffff83000011a6d0 800000011422c063 0000000000000002
(XEN)    00000000788f9067 ffff8300001be180 0000012d0851b708 0000010037e33f38
(XEN)    0000000080000b0e ffff8300002de080 ffff830000feff28 0000000000000000
(XEN)    0000000000000000 ffff83000016437a ffff8300001f1800 ffff830000155805
(XEN)    ffff8300002df778 00000000000000ef 0000000000000000 ffff830000155805
(XEN)    ffff8300002df830 ffff8300002de080 ffff8300002fc080 ffff83000014f14c
(XEN) Xen call trace:
(XEN)    [<ffff83000016b97f>] shadow_set_l1e+0x2f/0x1b0
(XEN)    [<ffff8300001708f9>] sh_page_fault__shadow_4_guest_4+0x8d9/0xeb0
(XEN)    [<ffff83000011acfc>] migrate_timer+0x17c/0x1a0
(XEN)    [<ffff83000012a1a9>] context_switch+0xb19/0xb60
(XEN)    [<ffff83000011a6d0>] add_entry+0x100/0x130
(XEN)    [<ffff83000016437a>] vmx_vmexit_handler+0x32a/0x16f0
(XEN)    [<ffff830000155805>] vlapic_has_interrupt+0x35/0x60
(XEN)    [<ffff830000155805>] vlapic_has_interrupt+0x35/0x60
(XEN)    [<ffff83000014f14c>] cpu_has_pending_irq+0x2c/0x60
(XEN)    [<ffff83000016575f>] vmx_asm_vmexit_handler+0x1f/0x30
(XEN)    
(XEN) Pagetable walk from ffff8180801bf198:
(XEN)  L4[0x103] = 000000000114e063 5555555555555555
(XEN)  L3[0x002] = 000000011422e063 000000000031dc2e
(XEN)  L2[0x000] = 000000011422d063 000000000031dc2d 
(XEN)  L1[0x1bf] = 800000011422c063 000000000031dc2c
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 2:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0009]
(XEN) Faulting linear address: ffff8180801bf198
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...

Chris Lalancette

Comment 2 Chris Lalancette 2007-10-16 23:44:15 UTC
OK, I did some more digging on this.  It looks like the place it crashes is in
shadow_set_l1e+0x2f, which happens to be:

arch/x86/mm/shadow/multi.c:shadow_set_l1e():

(gdb) list *(shadow_set_l1e+0x2f)
0xffff8300001763ef is in shadow_set_l1e (multi.c:1115).
1110        int flags = 0;
1111        struct domain *d = v->domain;
1112        shadow_l1e_t old_sl1e;
1113        ASSERT(sl1e != NULL);
1114        
1115        old_sl1e = *sl1e;
1116
1117        if ( old_sl1e.l1 == new_sl1e.l1 ) return 0; /* Nothing to do */
1118        
1119        if ( (shadow_l1e_get_flags(new_sl1e) & _PAGE_PRESENT)

It's on the dereference of sl1e, which is passed in.  This was called from
sh_page_fault(), from the ptr_sl1e variable.  %rsi contains the faulting
address, which also shows up in the pagetable walk at the end.  So it was
obviously filled in with a bogus address.

Chris Lalancette

Comment 3 RHEL Program Management 2007-10-17 02:25:01 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Bill Burns 2007-12-07 16:07:46 UTC
Changed component to kernel-xen.


Comment 10 Chris Lalancette 2007-12-18 22:09:11 UTC
Heh.  Well, there is good news and bad news on this bug.  The good news is that
with the 3.1.2 hypervisor, the dom0 crash no longer happens.  The bad news is
that the migration fails completely, and the guest is sort of left in limbo (on
the source machine, the qemu-dm process dies, but the guest is still alive in
the hypervisor's eyes).  This will need further debugging.

Chris Lalancette

Comment 11 Bill Burns 2008-04-01 14:08:58 UTC
Clearing blocker flag as we do not feel this should block RHEL 5.2.


Comment 13 RHEL Program Management 2008-06-09 22:00:34 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 14 Chris Lalancette 2008-08-18 13:08:14 UTC
Jan,
    Do you know if this is still an issue?  I'm thinking this is probably fixed in 5.2, but I'm not certain.

Thanks,
Chris Lalancette

Comment 15 Jan Mark Holzer 2008-08-18 21:01:17 UTC
Hi Chris,

I just re-run my tests between woodcrest and Caneland systems and indeed the guest will now migrate without a problem .
So I'd say it's fixed :)

- Jan

Comment 16 Chris Lalancette 2008-08-18 21:02:38 UTC
Jan,
    Great, thanks.  I'll mark this as CURRENTRELEASE.

Chris Lalancette