Red Hat Bugzilla – Bug 448115
Guest crash when host has >= 64G RAM
Last modified: 2011-01-24 18:02:50 EST
Description of problem:
Kernel 2.6.18-92.el5 crashes when run as a guest on a machine with >= 64G of
RAM. This is caused by the patches in 294811. The issue was fixed with
http://xenbits.xensource.com/xen-unstable.hg?rev/f36700819453 my apologies for
not including this in 294811.
Using IPI No-Shortcut mode
XENBUS: Device with no driver: device/vbd/51712
XENBUS: Device with no driver: device/vif/0
Freeing unused kernel memory: 176k freed
Write protecting the kernel read-only data: 379k
BUG: unable to handle kernel paging request at virtual address e100e160
00713000 -> *pde = 00000010:1fec1027
BUG: unable to handle kernel paging request at virtual address 15555840
00713000 -> *pde = 00000010:1febe027
BUG: unable to handle kernel paging request at virtual address 15555550
Version-Release number of selected component (if applicable):
Every boot on a host with >= 64GiB RAM
*** Bug 472290 has been marked as a duplicate of this bug. ***
I've uploaded a test kernel that contains this fix (along with several others)
to this location:
Could the original reporter try out the test kernels there, and report back if
it fixes the problem?
I can confirm that kernel-xen-2.6.18-128.el5virttest4.i686.rpm fixes the issue in a RHEL 5.2 guest on a 128G host. (installed with --nodeps due to an ecryptfs dependency)
(I also initially tested on 4.7 for some confused reason, so FWIW it worked there too ;-))
OK, that's great to hear. Thanks for the testing!
*** Bug 486863 has been marked as a duplicate of this bug. ***
Created attachment 333646 [details]
Backport of upstream xen-unstable c/s 13549
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.
Does this bug affect systems with 64GB of RAM that already have some guests running in the same manner? We have a customer that booted a guest on a server with 64GB of RAM with no other guests running yet and immediately they hit this bug. However, they then tried booting the same guest (same virtual disk and configuration) on another identical server with 64GB of RAM that already had 4 guests booted that had a total or 14GBytes of memory allocated and in use between the 4 guests (3 x 4GBytes and 1 x 2GBytes) and didn't hit the panic. The only difference between the two servers is one has guests booted and running and the other one does not. Would that affect this bug in some way?
Shawn, your situation is probably due to sheer luck - the top bit of memory (the memory the BIOS remapped above 64GB to make space for IO memory) was already allocated to other guests (presumably fully virtualized ones) so the newly started guest only got memory below 64GB.
You also have to be careful about what you are asking for. This bug specifically affects 32-bit PV guests running on 64-bit dom0. If that is *not* their situation, then their bug is something else. If that is their situation, it is possible that this caused it. It's hard to say, though; do you have stack traces and xm dmesg information from the problem domains?
The customer is running a 32bit guest on a 64bit capable hypervisor. The stack trace is:
Red Hat nash version 184.108.40.206 starting
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:310!
invalid opcode: 0000 [#1]
last sysfs file: /block/ram0/dev
Modules linked in:
EIP: 0061:[<c0454a77>] Not tainted VLI
EFLAGS: 00010246 (2.6.18-92.1.10.el5xen #1)
EIP is at release_pages+0x4e/0x137
eax: 00000000 ebx: e12423e0 ecx: 00000000 edx: 00000000
esi: c100988c edi: c1009878 ebp: 00000000 esp: c36c7dbc
ds: 007b es: 007b ss: 0069
Process init (pid: 271, ti=c36c7000 task=c36bf000 task.ti=c36c7000)
Stack: 00000005 00000000 00000000 c3217480 c3217500 c32174e0 c3217520 c123f000
c0199fe8 bfa3bfff bfa3c000 ed79cee4 00000000 00000000 bfa27000 c045e00c
00000000 e12423c0 c100988c 00000005 c1009878 c04644f0 00000005 00000005
Code: 8b 03 f6 c4 40 74 1d 85 d2 74 0d b0 01 86 82 80 11 00 00 e8 50 5e fc ff 89 d8 e8 8b ff ff ff e9 b8 00 00 00 8b 43 04 85 c0 75 08 <0f> 0b 36 01 b0 b6 62 c0 f0 ff 4b 04 0f 94 c0 84 c0 0f 84 9c 00
EIP: [<c0454a77>] release_pages+0x4e/0x137 SS:ESP 0069:c36c7dbc
<0>Kernel panic - not syncing: Fatal exception
I do not currently have the dmesg output from the hypervisor/domains. This may be hard to get. At this point, I think that the information provided thus far is enough to help us with understanding this specific edge case. If you want further information, you can chat with Chris Tatman, as he is our RedHat TAM and he can provide you with more information into this issue.
~~ Attention - RHEL 5.4 Beta Released! ~~
RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!
If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.
Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.
Questions can be posted to this bug or your customer or partner representative.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.