Bug 289841
Summary: | Xen Live Migration of x86_64 HVM guests cause target dom0/Hypervisor to crash | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jan Mark Holzer <jmh> |
Component: | kernel-xen | Assignee: | Chris Lalancette <clalance> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Martin Jenner <mjenner> |
Severity: | medium | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.1 | CC: | anton, dhoward, jvillalo, k.georgiou, xen-maint |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | GSSApproved | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-08-18 21:02:38 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 424241 |
Description
Jan Mark Holzer
2007-09-13 19:14:25 UTC
OK, from discussions with Jan, it seems like this is limited to a subset of machines. It also seems like this is limited to a particular direction; i.e. migrating a guest *from* Caneland *to* a Woodcrest. I captured a stack trace, it looks like this: (XEN) ----[ Xen-3.1.0-45.el5 x86_64 debug=n Not tainted ]---- (XEN) CPU: 2 (XEN) RIP: e008:[<ffff83000016b97f>] shadow_set_l1e+0x2f/0x1b0 (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor (XEN) rax: 0000000000000000 rbx: ffff8180801bf198 rcx: 000000000011422c (XEN) rdx: 00000000788f9067 rsi: ffff8180801bf198 rdi: ffff8300002de080 (XEN) rbp: ffff8300002de080 rsp: ffff830000fefc58 r8: ffff830000fefe38 (XEN) r9: 0000000000000006 r10: 0000000000000001 r11: ffff8300002fc080 (XEN) r12: 0000000000000000 r13: ffff830000feff28 r14: 000000000011422c (XEN) r15: ffff8180801bf198 cr0: 000000008005003b cr4: 00000000000026b0 (XEN) cr3: 000000000114d000 cr2: ffff8180801bf198 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff830000fefc58: (XEN) 00000000788f9067 ffff8180801bf198 ffff8300002de080 ffff8300002fc080 (XEN) ffff830000feff28 0000010037e33f38 0000000000037e00 ffff8300001708f9 (XEN) 0000000000000000 0000000000000000 00002aaaae2b4000 ffff8803d929d400 (XEN) ffff830000fefdb8 00002aaaae2b3000 00000000000788f9 000000000011422c (XEN) 0000000600000000 ffff8300002fd1c0 000000000011422d ffff8140a0602000 (XEN) 000000000000114e ffff8300002fcd00 ffff830000fdc0d8 0000000000000008 (XEN) ffff8300002fcd00 0000002000000001 ffff8300001c0180 0000000000000003 (XEN) ffff8300001c5180 ffff8300002a3098 ffff8300002a3080 000000f900000000 (XEN) ffff83000011acfc 000000000000e008 0000000000000293 ffff830000fefd80 (XEN) 0000000000000202 0000000000002000 ffff8300002fcd78 0000000000000003 (XEN) ffff8300002de080 ffff8300001f0800 0000000000000000 0000000000000000 (XEN) 0000010037e33f38 ffff830078a6c010 ffff830112041000 ffff830111e42df8 (XEN) 0000000000000000 0000000037e33067 0000000000078a6c 0000000000112041 (XEN) 0000000000111e42 ffffffffffffffff 0000000000002000 ffff8300002de080 (XEN) ffff83000012a1a9 ffff83000011a6d0 800000011422c063 0000000000000002 (XEN) 00000000788f9067 ffff8300001be180 0000012d0851b708 0000010037e33f38 (XEN) 0000000080000b0e ffff8300002de080 ffff830000feff28 0000000000000000 (XEN) 0000000000000000 ffff83000016437a ffff8300001f1800 ffff830000155805 (XEN) ffff8300002df778 00000000000000ef 0000000000000000 ffff830000155805 (XEN) ffff8300002df830 ffff8300002de080 ffff8300002fc080 ffff83000014f14c (XEN) Xen call trace: (XEN) [<ffff83000016b97f>] shadow_set_l1e+0x2f/0x1b0 (XEN) [<ffff8300001708f9>] sh_page_fault__shadow_4_guest_4+0x8d9/0xeb0 (XEN) [<ffff83000011acfc>] migrate_timer+0x17c/0x1a0 (XEN) [<ffff83000012a1a9>] context_switch+0xb19/0xb60 (XEN) [<ffff83000011a6d0>] add_entry+0x100/0x130 (XEN) [<ffff83000016437a>] vmx_vmexit_handler+0x32a/0x16f0 (XEN) [<ffff830000155805>] vlapic_has_interrupt+0x35/0x60 (XEN) [<ffff830000155805>] vlapic_has_interrupt+0x35/0x60 (XEN) [<ffff83000014f14c>] cpu_has_pending_irq+0x2c/0x60 (XEN) [<ffff83000016575f>] vmx_asm_vmexit_handler+0x1f/0x30 (XEN) (XEN) Pagetable walk from ffff8180801bf198: (XEN) L4[0x103] = 000000000114e063 5555555555555555 (XEN) L3[0x002] = 000000011422e063 000000000031dc2e (XEN) L2[0x000] = 000000011422d063 000000000031dc2d (XEN) L1[0x1bf] = 800000011422c063 000000000031dc2c (XEN) (XEN) **************************************** (XEN) Panic on CPU 2: (XEN) FATAL PAGE FAULT (XEN) [error_code=0009] (XEN) Faulting linear address: ffff8180801bf198 (XEN) **************************************** (XEN) (XEN) Reboot in five seconds... Chris Lalancette OK, I did some more digging on this. It looks like the place it crashes is in shadow_set_l1e+0x2f, which happens to be: arch/x86/mm/shadow/multi.c:shadow_set_l1e(): (gdb) list *(shadow_set_l1e+0x2f) 0xffff8300001763ef is in shadow_set_l1e (multi.c:1115). 1110 int flags = 0; 1111 struct domain *d = v->domain; 1112 shadow_l1e_t old_sl1e; 1113 ASSERT(sl1e != NULL); 1114 1115 old_sl1e = *sl1e; 1116 1117 if ( old_sl1e.l1 == new_sl1e.l1 ) return 0; /* Nothing to do */ 1118 1119 if ( (shadow_l1e_get_flags(new_sl1e) & _PAGE_PRESENT) It's on the dereference of sl1e, which is passed in. This was called from sh_page_fault(), from the ptr_sl1e variable. %rsi contains the faulting address, which also shows up in the pagetable walk at the end. So it was obviously filled in with a bogus address. Chris Lalancette This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Changed component to kernel-xen. Heh. Well, there is good news and bad news on this bug. The good news is that with the 3.1.2 hypervisor, the dom0 crash no longer happens. The bad news is that the migration fails completely, and the guest is sort of left in limbo (on the source machine, the qemu-dm process dies, but the guest is still alive in the hypervisor's eyes). This will need further debugging. Chris Lalancette Clearing blocker flag as we do not feel this should block RHEL 5.2. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Jan, Do you know if this is still an issue? I'm thinking this is probably fixed in 5.2, but I'm not certain. Thanks, Chris Lalancette Hi Chris, I just re-run my tests between woodcrest and Caneland systems and indeed the guest will now migrate without a problem . So I'd say it's fixed :) - Jan Jan, Great, thanks. I'll mark this as CURRENTRELEASE. Chris Lalancette |