Bug 224227 - [RHEL5] Fully virt install of RHEL-4 can reboot dom0
Summary: [RHEL5] Fully virt install of RHEL-4 can reboot dom0
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Rik van Riel
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-24 18:12 UTC by Chris Lalancette
Modified: 2007-11-30 22:07 UTC (History)
3 users (show)

Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-02-13 17:03:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
[XEN] Stricter TLB-flush discipline when unshadowing pagetables (4.81 KB, patch)
2007-01-25 13:32 UTC, Herbert Xu
no flags Details | Diff

Description Chris Lalancette 2007-01-24 18:12:39 UTC
Description of problem:
I'm currently running x86_64 kernel version 2.6.18-4, along with xen 3.0.3-21. 
I've been testing installing RHEL-4 U4 x86_64 as a fully virtualized guest, and
sometimes it fails, stack traces the HV, and reboots the HV/dom0.  The stack
trace from the serial console looks like:

(XEN) ----[ Xen-3.0.3-rc5-4.el5  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e010:[<ffff830000158728>] sh_page_fault__shadow_4_guest_4+0x5f8/0x1080
(XEN) RFLAGS: 0000000000010282   CONTEXT: hypervisor
(XEN) rax: ffff8140c0400040   rbx: ffff8300001a2080   rcx: 000000001f056000
(XEN) rdx: ffff8140c0000000   rsi: ffff8140a0503010   rdi: ffff8300001a2080
(XEN) rbp: ffff8300001af080   rsp: ffff8300001bfcb8   r8:  0000000000000002
(XEN) r9:  0000000000000000   r10: 000000001f056010   r11: 0000000000000000
(XEN) r12: 0000000000000002   r13: 00000000000270b5   r14: ffff8300001bff28
(XEN) r15: ffff8140a0602000   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 000000001f057000   cr2: ffff8140c0400040
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e010
(XEN) Xen stack trace from rsp=ffff8300001bfcb8:
(XEN)    00000100010aabf8 ffff81808003a098 00000000001c5f80 ffff8140c0400040
(XEN)    000000000003430d 000000000001f055 000000000001f056 0000000000007400
(XEN)    00000001008483e8 ffff8300001af080 0000000080000000 0000000080000000
(XEN)    0000000080000000 0000000080000000 0000002a9dd55fb8 0000010013ec5640
(XEN)    0000010012643aa8 0000010001269880 000001001c7bb1c0 0000000000000001
(XEN)    0000000000000246 0000007fbfffd260 800000000b070065 0000010017c4e770
(XEN)    800000000b070067 0000000000000037 ffffffff8043b680 0000010013ec5640
(XEN)    00000100016d9240 000000f000000003 0000000000100000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    00000100010aabf8 ffff830001c5e010 ffff830032a6b000 ffff830034973040
(XEN)    0000000000000000 00000000010aa067 0000000000001c5e 0000000000032a6b
(XEN)    0000000000034973 ffffffffffffffff ffff8300001bfd28 0000010012643aa8
(XEN)    ffff830000000008 0000000000400048 000000001f055663 0000000100000001
(XEN)    ffffffffff600071 ffff8300001bff28 ffff83000019c640 ffff8300001a2080
(XEN)    ffff8300001bff28 0000010017cd3d98 0000000000001000 ffff83000014c03b
(XEN)    00000100010aabf8 ffff830000150b49 ffff830000142daa ffff8300001af080
(XEN)    ffff8300001a2080 ffff83000014a6e8 ffff8300001bff28 ffff8300001af080
(XEN)    000001000eb20940 0000000000000000 000000000000000c 0000000000000000
(XEN)    0000010017cd3d98 ffff830000151738 0000000000001000 0000010017cd3d98
(XEN) Xen call trace:
(XEN)    [<ffff830000158728>] sh_page_fault__shadow_4_guest_4+0x5f8/0x1080
(XEN)    [<ffff83000014c03b>] vmx_do_page_fault+0x2b/0x50
(XEN)    [<ffff830000150b49>] vmx_vmexit_handler+0x339/0xf00
(XEN)    [<ffff830000142daa>] cpu_has_pending_irq+0x2a/0x50
(XEN)    [<ffff83000014a6e8>] vmx_intr_assist+0xf8/0x400
(XEN)    [<ffff830000151738>] vmx_asm_vmexit_handler+0x28/0x30
(XEN)
(XEN) Pagetable walk from ffff8140c0400040:
(XEN)  L4[0x102] = 000000001f057063 000000000004c457
(XEN)  L3[0x103] = 000000001f056063 000000000004c456
(XEN)  L2[0x002] = 000000001f055663 000000000004c455
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) CPU0 FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff8140c0400040
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...


I have a machine here that can fairly reliably reproduce this (not every time,
but often enough).  However, I have also seen it from time to time on two
separate machines.  I'm doing the install via NFS, using the following command-line:

virt-install -n rhel4fvtest -r 500 -f /mnt/xen/rhel4fvtest.dsk -s 5 --vnc -v -c
/mnt/xen/rhel4-u4-x86_64-boot.iso

When the guest initially boots from the CDROM, I type in:

boot: linux ks=http://server/ks.cfg

to automatically grab a kickstart file.  SELinux is disabled.

Comment 1 Stephen Tweedie 2007-01-24 20:58:08 UTC
Do we know if this is a recent regression?  It's important that we find out if
things were reliable on other recent kernels.  Thanks!


Comment 2 Chris Lalancette 2007-01-24 21:14:52 UTC
Stephen,
     Well, I was having good luck with 2961 earlier, but I tried it again with
2961 and got a different stack trace:

(XEN) ----[ Xen-3.0.3-rc5-1.2961.el5  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e010:[<ffff830000158528>] sh_page_fault__shadow_4_guest_4+0x5f8/0x1080
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: ffff8140c00aa788   rbx: ffff8300001a2080   rcx: 000000003416e000
(XEN) rdx: ffff8140c0000000   rsi: ffff8140a0503000   rdi: ffff8300001a2080
(XEN) rbp: ffff8300001af080   rsp: ffff8300001bfcb8   r8:  0000000000000002
(XEN) r9:  0000000000000000   r10: 000000003416e000   r11: 00000000000007e7
(XEN) r12: 0000000000000006   r13: 000000000001c85e   r14: ffff8300001bff28
(XEN) r15: ffff8140a0600550   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 000000003416f000   cr2: ffff8140c00aa788
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e010
(XEN) Xen stack trace from rsp=ffff8300001bfcb8:
(XEN)    0000002a9e2c0000 ffff81800002c6d8 00000000001a2080 ffff8140c00aa788
(XEN)    0000000000001c95 000000000003416d 000000000003416e 0000000000005800
(XEN)    00000001001a2080 ffff8300001af080 0000000080000000 0000000080000000
(XEN)    0000000080000000 0000000080000000 0000009070441b8b ffffffff804bca20
(XEN)    000000008f0f0048 ffff83000013bd10 ffff8300001a36b0 ffff83000014e7d1
(XEN)    0000000000000000 ffff83000013bd8f 0000000000000002 ffff83000012f69a
(XEN)    ffff8300001a36b0 ffff83000014e7d1 0000000000000002 0000000000000001
(XEN)    0000000000000002 ffff830000140348 0000000000000001 ffff830000116ba1
(XEN)    ffff8300001b4080 ffff830000116ba1 0000000000000001 ffff830010d60000
(XEN)    0000000000000001 ffff83000013fbbc 0000009094189ab7 0000000000000292
(XEN)    0000002a9e2c0000 ffff830007c5e000 ffff8300384e0550 ffff8300384b9788
(XEN)    ffff83001aa88600 8000000003901067 0000000000007c5e 00000000000384e0
(XEN)    00000000000384b9 000000000001aa88 ffff8300001a2080 ffff83000014b1f5
(XEN)    0000000000000000 ffff83000011e949 000000003416d667 0000000001c9c380
(XEN)    0000000000000000 ffff8300001bff28 ffff83000019c640 ffff8300001a2080
(XEN)    ffff8300001bff28 00000000000003e4 0000000000000c1c ffff83000014be4b
(XEN)    0000002a9e2c0000 ffff830000150959 ffff830000142baa ffff8300001af080
(XEN)    ffff8300001a2080 ffff83000014a4f8 ffff8300001bff28 ffff8300001af080
(XEN)    00000100010c8520 0000000000000c1c 0000010017ce3d68 0000000000008000
(XEN)    00000000000003e4 ffff830000151548 0000000000000c1c 00000000000003e4
(XEN) Xen call trace:
(XEN)    [<ffff830000158528>] sh_page_fault__shadow_4_guest_4+0x5f8/0x1080
(XEN)    [<ffff83000013bd10>] pit_get_count+0x30/0xa0
(XEN)    [<ffff83000014e7d1>] vmx_load_cpu_guest_regs+0x11/0x300
(XEN)    [<ffff83000013bd8f>] pit_latch_count+0xf/0x20
(XEN)    [<ffff83000012f69a>] smp_send_event_check_mask+0x3a/0x40
(XEN)    [<ffff83000014e7d1>] vmx_load_cpu_guest_regs+0x11/0x300
(XEN)    [<ffff830000140348>] send_pio_req+0x1c8/0x240
(XEN)    [<ffff830000116ba1>] add_entry+0xe1/0x110
(XEN)    [<ffff830000116ba1>] add_entry+0xe1/0x110
(XEN)    [<ffff83000013fbbc>] hvm_io_assist+0x89c/0x960
(XEN)    [<ffff83000014b1f5>] arch_vmx_do_resume+0x55/0x70
(XEN)    [<ffff83000011e949>] context_switch+0x639/0x650
(XEN)    [<ffff83000014be4b>] vmx_do_page_fault+0x2b/0x50
(XEN)    [<ffff830000150959>] vmx_vmexit_handler+0x339/0xf00
(XEN)    [<ffff830000142baa>] cpu_has_pending_irq+0x2a/0x50
(XEN)    [<ffff83000014a4f8>] vmx_intr_assist+0xf8/0x400
(XEN)    [<ffff830000151548>] vmx_asm_vmexit_handler+0x28/0x30
(XEN)
(XEN) Pagetable walk from ffff8140c00aa788:

Chris Lalancette

Comment 5 Stephen Tweedie 2007-01-24 23:36:04 UTC
I've done a couple of installs of x86_64 RHEL-4-AS U4 with 500mb myself now, one
manual, one kickstart, and it has worked fine.  Will try on file-backed next,
it's been on lvm so far.

What sort of hardware are you running on, btw?


Comment 8 Herbert Xu 2007-01-25 13:32:28 UTC
Created attachment 146536 [details]
[XEN] Stricter TLB-flush discipline when unshadowing pagetables

This is a backport of upstream changeset 11852 which may cause issues like
this.  Please check if this makes the problem go away.	Thanks!

Comment 9 Stephen Tweedie 2007-01-25 14:11:03 UTC
Good, I can finally reproduce this problem.  I haven't been able to see it using
a blkback virtual disk, but using a blktap file-backed disk I was able to
reproduce it first time.  And that does make sense given the suggested patch ---
blkback is in-kernel, so doesn't task-switch and is less likely to be disturbed
by missing tlb flushes.

Will try again with the patch.


Comment 10 Stephen Tweedie 2007-01-25 14:19:21 UTC
Except, of course, only PV uses blktap for file-backed domains, FV does not.  So
that theory goes out the window.  FV file-backed domains will still put more
pressure on the VM, though, which might well make a difference in this case
(O_DIRECT will still use cache for metadata and for filling in of holes in the
backing file.)


Comment 11 Stephen Tweedie 2007-01-25 16:07:18 UTC
Three successful installs in a row --- initial testing with this patch looks
good.    Will continue to repeat.

Comment 12 Chris Lalancette 2007-01-25 16:45:39 UTC
I've now had 10 successful FV installs in a row on 3 different machines with the
patch, including 5 successful installs on a machine that would reproduce the
behavior without the patch 50% of the time.  I'm building a kernel now for
further testing by partners.

Chris Lalancette

Comment 14 RHEL Program Management 2007-01-25 19:20:54 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 16 Jay Turner 2007-01-25 19:31:30 UTC
QE ack for RHEL5.

Comment 18 Don Zickus 2007-01-25 22:30:41 UTC
in 2.6.18-7.el5

Comment 22 Jay Turner 2007-02-13 17:03:02 UTC
Closing out.


Note You need to log in before you can comment on or make changes to this bug.