Description of problem: Kernel 2.6.18-92.el5 crashes when run as a guest on a machine with >= 64G of RAM. This is caused by the patches in 294811. The issue was fixed with http://xenbits.xensource.com/xen-unstable.hg?rev/f36700819453 my apologies for not including this in 294811. Guest output: Using IPI No-Shortcut mode XENBUS: Device with no driver: device/vbd/51712 XENBUS: Device with no driver: device/vif/0 Freeing unused kernel memory: 176k freed Write protecting the kernel read-only data: 379k BUG: unable to handle kernel paging request at virtual address e100e160 printing eip: c0457e5a 00713000 -> *pde = 00000010:1fec1027 BUG: unable to handle kernel paging request at virtual address 15555840 printing eip: c060a8f3 00713000 -> *pde = 00000010:1febe027 BUG: unable to handle kernel paging request at virtual address 15555550 printing eip: c060a8f31 etc... Version-Release number of selected component (if applicable): 2.6.18-92.el5 How reproducible: Every boot on a host with >= 64GiB RAM
*** Bug 472290 has been marked as a duplicate of this bug. ***
I've uploaded a test kernel that contains this fix (along with several others) to this location: http://people.redhat.com/clalance/virttest Could the original reporter try out the test kernels there, and report back if it fixes the problem? Thanks, Chris Lalancette
I can confirm that kernel-xen-2.6.18-128.el5virttest4.i686.rpm fixes the issue in a RHEL 5.2 guest on a 128G host. (installed with --nodeps due to an ecryptfs dependency) (I also initially tested on 4.7 for some confused reason, so FWIW it worked there too ;-))
OK, that's great to hear. Thanks for the testing! Chris Lalancette
*** Bug 486863 has been marked as a duplicate of this bug. ***
Created attachment 333646 [details] Backport of upstream xen-unstable c/s 13549
in kernel-2.6.18-133.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Chris, Does this bug affect systems with 64GB of RAM that already have some guests running in the same manner? We have a customer that booted a guest on a server with 64GB of RAM with no other guests running yet and immediately they hit this bug. However, they then tried booting the same guest (same virtual disk and configuration) on another identical server with 64GB of RAM that already had 4 guests booted that had a total or 14GBytes of memory allocated and in use between the 4 guests (3 x 4GBytes and 1 x 2GBytes) and didn't hit the panic. The only difference between the two servers is one has guests booted and running and the other one does not. Would that affect this bug in some way?
Shawn, your situation is probably due to sheer luck - the top bit of memory (the memory the BIOS remapped above 64GB to make space for IO memory) was already allocated to other guests (presumably fully virtualized ones) so the newly started guest only got memory below 64GB.
You also have to be careful about what you are asking for. This bug specifically affects 32-bit PV guests running on 64-bit dom0. If that is *not* their situation, then their bug is something else. If that is their situation, it is possible that this caused it. It's hard to say, though; do you have stack traces and xm dmesg information from the problem domains? Chris Lalancette
The customer is running a 32bit guest on a 64bit capable hypervisor. The stack trace is: Red Hat nash version 5.1.19.6 starting ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:310! invalid opcode: 0000 [#1] SMP last sysfs file: /block/ram0/dev Modules linked in: CPU: 2 EIP: 0061:[<c0454a77>] Not tainted VLI EFLAGS: 00010246 (2.6.18-92.1.10.el5xen #1) EIP is at release_pages+0x4e/0x137 eax: 00000000 ebx: e12423e0 ecx: 00000000 edx: 00000000 esi: c100988c edi: c1009878 ebp: 00000000 esp: c36c7dbc ds: 007b es: 007b ss: 0069 Process init (pid: 271, ti=c36c7000 task=c36bf000 task.ti=c36c7000) Stack: 00000005 00000000 00000000 c3217480 c3217500 c32174e0 c3217520 c123f000 c0199fe8 bfa3bfff bfa3c000 ed79cee4 00000000 00000000 bfa27000 c045e00c 00000000 e12423c0 c100988c 00000005 c1009878 c04644f0 00000005 00000005 Call Trace: [<c045e00c>] free_pgtables+0x69/0x76 [<c04644f0>] free_pages_and_swap_cache+0x6b/0x7f [<c045f23f>] exit_mmap+0xb0/0xe4 [<c041f6ee>] mmput+0x25/0x69 [<c0476ecd>] flush_old_exec+0x629/0x8af [<c046c1b7>] get_unused_fd+0x54/0xb5 [<c0493860>] load_elf_binary+0x494/0x15e7 [<c06094b8>] _spin_lock_irqsave+0x8/0x28 [<c04582b6>] page_address+0x7a/0x81 [<c04589bb>] kmap_high+0x1c/0x2b1 [<c06094b8>] _spin_lock_irqsave+0x8/0x28 [<c04582b6>] page_address+0x7a/0x81 [<c0476072>] search_binary_handler+0x99/0x219 [<c0477a4f>] do_execve+0x158/0x1f5 [<c040337d>] sys_execve+0x2a/0x4a [<c0405413>] syscall_call+0x7/0xb ======================= Code: 8b 03 f6 c4 40 74 1d 85 d2 74 0d b0 01 86 82 80 11 00 00 e8 50 5e fc ff 89 d8 e8 8b ff ff ff e9 b8 00 00 00 8b 43 04 85 c0 75 08 <0f> 0b 36 01 b0 b6 62 c0 f0 ff 4b 04 0f 94 c0 84 c0 0f 84 9c 00 EIP: [<c0454a77>] release_pages+0x4e/0x137 SS:ESP 0069:c36c7dbc <0>Kernel panic - not syncing: Fatal exception I do not currently have the dmesg output from the hypervisor/domains. This may be hard to get. At this point, I think that the information provided thus far is enough to help us with understanding this specific edge case. If you want further information, you can chat with Chris Tatman, as he is our RedHat TAM and he can provide you with more information into this issue. Thank you, Shawn
~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html