Bug 412691
Summary: | kernel-xen panic when X shuts down | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Prarit Bhargava <prarit> | ||||||
Component: | kernel-xen | Assignee: | Rik van Riel <riel> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.2 | CC: | ajax, akarlsso, clalance, donald.d.dugger, fal.diabate, gcase, jane.lv, jvillalo, keve.a.gabbert, riel, syeghiay, xen-maint, youquan.song, yunhong.jiang | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-01-20 20:03:54 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Prarit Bhargava
2007-12-05 19:49:22 UTC
A workaround appears to be: 1. open the /etc/X11/xorg.conf file 2. replace Driver "i810" with Driver "vesa" This results in no panic during X shutdown. Do you know if this is still an issue with later kernels? Some fixes that effect X have been made that might have addressed this. Can you possibly try the -91 kernel? Thanks. It's still broken. I installed my 8GB DQ35JO system using the snap7 tree (-91) and i386 arch gets stuck in an endless boot loop because of the panic. It appears that we no longer load 'i810' driver. We've switched to 'intel', but the same problem occurs. The workaround of switching to 'vesa' still works. Ok, thanks for the info. Per Ron's request - the latest BIOS version for JO and MP is 0942. Pls see this URL for download info. http://downloadcenter.intel.com/Filter_Results.aspx?strTypes=all&ProductID=2784&OSFullname=OS+Independent&strOSs=38 Regards, Fal static int __change_page_attr(struct page *page, pgprot_t prot) { pte_t *kpte; unsigned long address; struct page *kpte_page; BUG_ON(PageHighMem(page)); address = (unsigned long)page_address(page); This is the BUG being triggered in comment #1. On the system in question, does the i810 DRM driver by chance mmap physical memory at addresses higher than 1GB into the frame buffer? I spent some time reading this code and I don't understand some things: - why is the i915 driver using the agp memory code? - if it is using the agp memory code, why did it never call map_page_into_agp(), which should have also run into the BUG_ON in __change_page_attr() ? On another Johannesburg system (the one I tried to reproduce the X shutdown bug on), X manages to crash the system very badly at startup. Every time I start up X, I get a different hypervisor panic. This leads me to believe that X, with help from the i915 driver, is corrupting hypervisor memory. I tried running X straced (over serial console), but the last few thousand lines of strace output are al gettimeofday syscalls. Presumably X is prodding the hardware through mmaps in-between the gettimeofday calls and one of the memory writes is causing the hypervisor to panic. Another data point: when the hypervisor is limited to 3GB memory, things work normally. Going to 4GB or more causes things to break. Created attachment 321212 [details]
x86 numa: Fix the overflow of physical addresses.
First potentially relevant changeset I found while combing through upstream.
Progress! With the hypervisor patch, my test system no longer crashes on X startup. Instead, I get the same dom0 kernel panic as in comment #0 when X exits. I posted the patches for review this morning. Umm, wrong browser tab. Moving the *other* bug to POST now :) Don, please add this to your agenda for RH-Intel Virtualization meeting. This issue is blocking the Certification of DQ35MP. I'm not sure if following changeset in Xen's linux tree related to this bug, but I'm sure this changeset is needed for intel platform to work, if not for this bug. BTW,for comments #13, "if it is using the agp memory code, why did it never call map_page_into_agp(), which should have also run into the BUG_ON in __change_page_attr() ?". it is not always correct. When map_page_into_agp, the page is just allocated, and is ok. however, when unmap_page_from_agp, the page is got from gart_to_virt, which is not always correct. If it is do for this page, then I suspect it is because in agp_allocate_memory() in drivers/char/agp/generic.c , when "new->memory[i] = virt_to_gart(addr);", the memory is defined as unsigned long *, may lost some information of virt_to_gart(). I will attach the details of the patch also. # HG changeset patch # User kfraser # Date 1182429682 -3600 # Node ID 02a46885bd90a4d936338c135023b511318c7aa2 # Parent c8c9bc0b7e29e804c09d4375a0e655cda826a9e4 linux: fix agp address handling, namely intel-agp Make sure machine addresses are in fact constrained to 32 bits, and assumptions about multi-page extents being contiguous are being met. Generic parts of the patch are in 2.6.22-rc4. Signed-off-by: Jan Beulich <jbeulich> Seems I can't attach the patch, so I'd give the URL for this patch. http://xenbits.xensource.com/linux-2.6.18-xen.hg?rev/02a46885bd90 Thank you for finding that patch, Yunhong! Together with the hypervisor patch, that may make things work again. Of course, I will have to change the patch a bit so the kABI stays the same (we cannot get rid of two exported symbols in-between RHEL updates), but that looks doable. So, Rik, waiting for your try. Also, please check following URL: http://lkml.org/lkml/2007/4/2/186 Most the other part is similar to patch in comments 32, but I'm not sure if following chunk is needed also. @@ -206,7 +207,7 @@ static void i8xx_destroy_pages(void *add global_flush_tlb(); put_page(page); unlock_page(page); - free_pages((unsigned long)addr, 2); + __free_pages(page, 2); atomic_dec(&agp_bridge->current_memory_agp); } Any update on this issue? Is it working now? We tried the patch on our side, and seems the issues is caused by wrong E820 table. When we populated 4G memory, the E820 table is reported as following, which means only 512M memory is usable to OS. As xen will only reserve min(memory/16, 128M) for DMA buffer, it means only 32M is reserved(check compute_dom0_nr_pages() in arch/x86/domain_build.c for details please). One potential improvement for the patch in comments #23 is to add some error handling in map_page_into_agp(page) macro defined in include/asm-i386/mach-xen/asm/agp.h, to return failure if xen_create_contiguous_region() failed, and also update agp_generic_alloc_page() to handle such failure. But even with that ehancement, the system can't start XWindow still. Nov 11 16:01:45 localhost kernel: BIOS-provided physical RAM map: Nov 11 16:01:45 localhost kernel: BIOS-e820: 0000000000000000 - 000000000009f800 (usable) Nov 11 16:01:45 localhost kernel: BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) Nov 11 16:01:45 localhost kernel: BIOS-e820: 0000000000100000 - 0000000020d73000 (usable) Nov 11 16:01:45 localhost kernel: BIOS-e820: 0000000020d73000 - 00000000cd174000 (reserved) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000cd174000 - 00000000cdbfd000 (usable) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000cdbfd000 - 00000000cdca2000 (ACPI NVS) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000cdca2000 - 00000000ceeca000 (usable) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000ceeca000 - 00000000ceecc000 (reserved) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000ceecc000 - 00000000cef84000 (usable) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000cef84000 - 00000000cefe5000 (ACPI NVS) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000cefe5000 - 00000000cefea000 (usable) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000cefea000 - 00000000ceff3000 (ACPI data) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000ceff3000 - 00000000ceff4000 (usable) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000ceff4000 - 00000000cefff000 (ACPI data) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000cefff000 - 00000000cf000000 (usable) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000cf000000 - 00000000d0000000 (reserved) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) Nov 11 16:01:45 localhost kernel: BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved) Nov 11 16:01:45 localhost kernel: BIOS-e820: 0000000100000000 - 000000012c000000 (usable) Apparently, there are more bugs lurking somewhere :( ------------[ cut here ]------------ kernel BUG at arch/i386/mm/pageattr.c:156! invalid opcode: 0000 [#1] SMP last sysfs file: /class/drm/card0/dev Modules linked in: i915 drm netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat bridge autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport sr_mod cdrom sg snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 e1000e snd_timer snd_page_alloc snd_hwdep snd i2c_core soundcore serio_raw serial_core pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod pata_marvell ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 0 EIP: 0061:[<c0416981>] Not tainted VLI EFLAGS: 00210046 (2.6.18-124.el5.bz412691.2xen #1) EIP is at change_page_attr+0x571/0x7e0 eax: 00000000 ebx: 06881063 ecx: 80000002 edx: 80000002 esi: 06881063 edi: c00412e8 ebp: c1672820 esp: ebbf5eb0 ds: 007b es: 007b ss: 0069 Process X (pid: 7339, ti=ebbf5000 task=ecd85000 task.ti=ebbf5000) Stack: 00000000 80000002 00000001 c17a2ba0 00000000 00000000 c1672820 00000000 00000000 c985d000 c00412e8 00000000 00000000 c1672000 00000004 00026174 00000063 80000000 00000001 00000000 00000000 06881063 80000002 00000001 Call Trace: [<c053f493>] unmap_page_from_agp+0x27/0x2b [<c053f4b8>] agp_generic_destroy_page+0x21/0x44 [<c053f39f>] agp_free_memory+0x9e/0xd4 [<c053f497>] agp_generic_destroy_page+0x0/0x44 [<c053e6e6>] agp_release+0x7e/0x143 [<c046fc1b>] __fput+0x9c/0x167 [<c046d609>] filp_close+0x4e/0x54 [<c046e835>] sys_close+0x71/0xa8 [<c0405413>] syscall_call+0x7/0xb ======================= Code: 89 54 24 04 89 f3 8b 4c 24 04 89 04 24 89 74 24 54 89 54 24 58 8b 07 8b 57 04 f0 0f c7 0f 75 f5 8b 6c 24 18 8b 45 0c 85 c0 75 08 <0f> 0b 9c 00 f2 cf 62 c0 8b 54 24 18 48 89 42 0c eb 08 0f 0b 9f EIP: [<c0416981>] change_page_attr+0x571/0x7e0 SS:ESP 0069:ebbf5eb0 <0>Kernel panic - not syncing: Fatal exception BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted) [<c0410983>] smp_call_function+0x59/0xfe [<c0410a3b>] smp_send_stop+0x13/0x1e [<c041f68b>] panic+0x4c/0x171 [<c04060a5>] die+0x262/0x296 [<c04065f5>] do_invalid_op+0x0/0x9d [<c0406686>] do_invalid_op+0x91/0x9d [<c0416981>] change_page_attr+0x571/0x7e0 [<c04517af>] __generic_file_aio_write_nolock+0x4a6/0x52a [<c04be286>] avc_has_perm+0x3a/0x44 [<c0405597>] error_code+0x2b/0x30 [<c0416981>] change_page_attr+0x571/0x7e0 [<c053f493>] unmap_page_from_agp+0x27/0x2b [<c053f4b8>] agp_generic_destroy_page+0x21/0x44 [<c053f39f>] agp_free_memory+0x9e/0xd4 [<c053f497>] agp_generic_destroy_page+0x0/0x44 [<c053e6e6>] agp_release+0x7e/0x143 [<c046fc1b>] __fput+0x9c/0x167 [<c046d609>] filp_close+0x4e/0x54 [<c046e835>] sys_close+0x71/0xa8 [<c0405413>] syscall_call+0x7/0xb ======================= (XEN) Domain 0 crashed: rebooting machine in 5 seconds. Created attachment 323960 [details]
kernel patch with the fixes
The kernel side patch I used, in addition to the Xen hypervisor patch from the other attachment.
With both of these patches, I still get the oops.
Heh, this may have been a logic inversion. Let me try again with this:
diff -r1.1.2.1 linux-2.6-xen-agp-paddr-overflow.patch
50c50
< + if (xen_create_contiguous_region((unsigned long)page_address(page), 0, 32))
---
> + if (!xen_create_contiguous_region((unsigned long)page_address(page), 0, 32))
No luck, still the same bug :( ------------[ cut here ]------------ kernel BUG at arch/i386/mm/pageattr.c:156! invalid opcode: 0000 [#1] SMP last sysfs file: /class/drm/card0/dev Modules linked in: i915 drm netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat bridge autofs4 hidp nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport sr_mod cdrom sg snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss serial_core snd_pcm snd_timer i2c_i801 snd_page_alloc snd_hwdep snd soundcore e1000e serio_raw i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod pata_marvell ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 0 EIP: 0061:[<c0416981>] Not tainted VLI EFLAGS: 00210046 (2.6.18-124.el5.bz412691.3xen #1) EIP is at change_page_attr+0x571/0x7e0 eax: 00000000 ebx: 06be3063 ecx: 80000002 edx: 80000002 esi: 06be3063 edi: c0041a20 ebp: c1672820 esp: eb4d0eb0 ds: 007b es: 007b ss: 0069 Process X (pid: 7350, ti=eb4d0000 task=ecd63550 task.ti=eb4d0000) Stack: 00000000 80000002 00000001 c17a4880 00000000 00000000 c1672820 00000000 00000000 c9944000 c0041a20 00000000 00000000 c1672000 00000004 00026510 00000063 80000000 00000001 00000000 00000000 06be3063 80000002 00000001 Call Trace: [<c053f493>] unmap_page_from_agp+0x27/0x2b [<c053f4b8>] agp_generic_destroy_page+0x21/0x44 [<c053f39f>] agp_free_memory+0x9e/0xd4 [<c053f497>] agp_generic_destroy_page+0x0/0x44 [<c053e6e6>] agp_release+0x7e/0x143 [<c046fc1b>] __fput+0x9c/0x167 [<c046d609>] filp_close+0x4e/0x54 [<c046e835>] sys_close+0x71/0xa8 [<c0405413>] syscall_call+0x7/0xb ======================= Code: 89 54 24 04 89 f3 8b 4c 24 04 89 04 24 89 74 24 54 89 54 24 58 8b 07 8b 57 04 f0 0f c7 0f 75 f5 8b 6c 24 18 8b 45 0c 85 c0 75 08 <0f> 0b 9c 00 f2 cf 62 c0 8b 54 24 18 48 89 42 0c eb 08 0f 0b 9f EIP: [<c0416981>] change_page_attr+0x571/0x7e0 SS:ESP 0069:eb4d0eb0 <0>Kernel panic - not syncing: Fatal exception BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted) [<c0410983>] smp_call_function+0x59/0xfe [<c0410a3b>] smp_send_stop+0x13/0x1e [<c041f68b>] panic+0x4c/0x171 [<c04060a5>] die+0x262/0x296 [<c04065f5>] do_invalid_op+0x0/0x9d [<c0406686>] do_invalid_op+0x91/0x9d [<c0416981>] change_page_attr+0x571/0x7e0 [<c040e7e9>] generic_get_mtrr+0x21/0x42 [<c04517af>] __generic_file_aio_write_nolock+0x4a6/0x52a [<c04be286>] avc_has_perm+0x3a/0x44 [<c0405597>] error_code+0x2b/0x30 [<c0416981>] change_page_attr+0x571/0x7e0 [<c053f493>] unmap_page_from_agp+0x27/0x2b [<c053f4b8>] agp_generic_destroy_page+0x21/0x44 [<c053f39f>] agp_free_memory+0x9e/0xd4 [<c053f497>] agp_generic_destroy_page+0x0/0x44 [<c053e6e6>] agp_release+0x7e/0x143 [<c046fc1b>] __fput+0x9c/0x167 [<c046d609>] filp_close+0x4e/0x54 [<c046e835>] sys_close+0x71/0xa8 [<c0405413>] syscall_call+0x7/0xb ======================= (XEN) Domain 0 crashed: rebooting machine in 5 seconds. Can you please try add dom0_mem=512M to grub's xen entry when the memory is populated to 4G ? That should workaround this issue. Also, I think patch in comments 31 is not needed, since the xen_create_contiguous_region() will return 0 for success, but maybe we need add some check, so that if the xen_create_contiguous_region() failed, we need to fail agp_generic_alloc_page() also. Thanks Yunhong Jiang Booting with dom0_mem=512M does indeed avoid the bug. Of course, that is probably not an acceptable thing to do for RHEL :) Can you boot native Linux on that machine and attach the E820 memory map reported to the BZ entry? I'd like to verify that you are seeing the same BIOS issue we are. With 4G of RAM in your machine the E820 map should be showing <1G available. Here is the e820 map as printed out by the Xen hypervisor: (XEN) Xen-e820 RAM map: (XEN) 0000000000000000 - 000000000009dc00 (usable) (XEN) 000000000009dc00 - 00000000000a0000 (reserved) (XEN) 00000000000e0000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 00000000cdc90000 (usable) (XEN) 00000000cdc90000 - 00000000cdcf6000 (ACPI NVS) (XEN) 00000000cdcf6000 - 00000000ceec6000 (usable) (XEN) 00000000ceec6000 - 00000000ceec8000 (reserved) (XEN) 00000000ceec8000 - 00000000cef7a000 (usable) (XEN) 00000000cef7a000 - 00000000cefe5000 (ACPI NVS) (XEN) 00000000cefe5000 - 00000000cefe7000 (usable) (XEN) 00000000cefe7000 - 00000000ceff3000 (ACPI data) (XEN) 00000000ceff3000 - 00000000ceff4000 (usable) (XEN) 00000000ceff4000 - 00000000cefff000 (ACPI data) (XEN) 00000000cefff000 - 00000000cf000000 (usable) (XEN) 00000000cf000000 - 00000000d0000000 (reserved) (XEN) 00000000f0000000 - 00000000f8000000 (reserved) (XEN) 00000000ffc00000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 000000022c000000 (usable) (XEN) System RAM: 8110MB (8305356kB) The e820 map seen by the non-xen kernel is the same: BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009dc00 (usable) BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000cdc90000 (usable) BIOS-e820: 00000000cdc90000 - 00000000cdcf6000 (ACPI NVS) BIOS-e820: 00000000cdcf6000 - 00000000ceec6000 (usable) BIOS-e820: 00000000ceec6000 - 00000000ceec8000 (reserved) BIOS-e820: 00000000ceec8000 - 00000000cef7a000 (usable) BIOS-e820: 00000000cef7a000 - 00000000cefe5000 (ACPI NVS) BIOS-e820: 00000000cefe5000 - 00000000cefe7000 (usable) BIOS-e820: 00000000cefe7000 - 00000000ceff3000 (ACPI data) BIOS-e820: 00000000ceff3000 - 00000000ceff4000 (usable) BIOS-e820: 00000000ceff4000 - 00000000cefff000 (ACPI data) BIOS-e820: 00000000cefff000 - 00000000cf000000 (usable) BIOS-e820: 00000000cf000000 - 00000000d0000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 000000022c000000 (usable) 8000MB HIGHMEM available. 896MB LOWMEM available. FYI, here is the dmidecode info on the BIOS: Handle 0x0005, DMI type 0, 24 bytes. BIOS Information Vendor: Intel Corp. Version: JOQ3510J.86A.0942.2008.0807.1958 Release Date: 08/07/2008 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 4096 kB Characteristics: PCI is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported ATAPI Zip drive boot is supported BIOS boot specification is supported Function key-initiated network boot is supported Targeted content distribution is supported BIOS Revision: 0.0 Firmware Revision: 0.0 Riv, according to comments 32, seems it is at line 156 of arch/i386/mm/pageattr.c, while originally it is in line 130. I have a look on the code and a bit strange why it hit line 156. The pgprot_val should have been changed to PAGE_KERNEL_NOCACHE, so we should be in the first "if" statement and instead of the BUG_ON in the "else if" statement. Anyway, can you add some changes to agp_generic_alloc_page(), so that it will fail if the map_page_into_agp() failed? if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { if ((pte_val(*kpte) & _PAGE_PSE) == 0) { set_pte_atomic(kpte, mk_pte(page, prot)); } else { pgprot_t ref_prot; struct page *split; ref_prot = ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext) ? PAGE_KERNEL_EXEC : PAGE_KERNEL; split = split_large_page(address, prot, ref_prot); if (!split) return -ENOMEM; set_pmd_pte(kpte,address,mk_pte(split, ref_prot)); kpte_page = split; } page_private(kpte_page)++; } else if ((pte_val(*kpte) & _PAGE_PSE) == 0) { set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL)); BUG_ON(page_private(kpte_page) == 0); page_private(kpte_page)--; } else BUG(); Riv, after more investigation, we have got the reason of the panic. Currently xen reserve 128M DMA buffer at most, while the on-board graphic card requires 256M memory. With following patch + xen patch + your patch in comments 30+31, everything works quite well. diff -r b90893077a90 xen/arch/x86/domain_build.c --- a/xen/arch/x86/domain_build.c Thu Nov 20 07:29:20 2008 +0800 +++ b/xen/arch/x86/domain_build.c Thu Nov 20 07:29:39 2008 +0800 @@ -139,7 +139,7 @@ static unsigned long __init compute_dom0 if ( dom0_nrpages == 0 ) { dom0_nrpages = avail; - dom0_nrpages = min(dom0_nrpages / 16, 128L << (20 - PAGE_SHIFT)); + dom0_nrpages = min(dom0_nrpages / 8, 384L << (20 - PAGE_SHIFT)); dom0_nrpages = -dom0_nrpages; } There are some alternative method to achieve this method: a) Update xen, so that when dom0 allocate page with GFP_DMA32, it will get memory below 4G, instead of >4G memory. This requies change Xen on how to setup mapping for dom0, however, seems upstream does not want to accept this solution. See http://article.gmane.org/gmane.comp.emulators.xen.devel/58160 for more discussion. With all three patches, the oops no longer happens. Of course, X still does not actually work right, but the oops seems to be gone. Doh ... X was just misdetecting the monitor now. Everything is working now with the 3 patches above. in kernel-2.6.18-125.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Jiang, Yunhong, Have you been able to validate the kernel mention in comment #45? I emailed Yunhong and they said that their motherboard they were using for testing is currently broken so they are unable to test this issue. I've received confirmation that this issue has definitely been fixed based on testing with our DQ35JO system. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |