448115 – Guest crash when host has >= 64G RAM

Bug 448115 - Guest crash when host has >= 64G RAM

Summary: Guest crash when host has >= 64G RAM

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Rik van Riel
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	472290 486863 (view as bug list)
Depends On:
Blocks:	504988 664784
TreeView+	depends on / blocked

Reported:	2008-05-23 15:25 UTC by Ian Campbell
Modified:	2018-10-19 18:50 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	504988 664784 (view as bug list)
Environment:
Last Closed:	2009-09-02 09:01:35 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Backport of upstream xen-unstable c/s 13549 (1021 bytes, patch) 2009-03-01 14:16 UTC, Chris Lalancette	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Ian Campbell 2008-05-23 15:25:52 UTC

Description of problem:

Kernel 2.6.18-92.el5 crashes when run as a guest on a machine with >= 64G of
RAM. This is caused by the patches in 294811. The issue was fixed with
http://xenbits.xensource.com/xen-unstable.hg?rev/f36700819453 my apologies for
not including this in 294811.

Guest output:
Using IPI No-Shortcut mode
XENBUS: Device with no driver: device/vbd/51712
XENBUS: Device with no driver: device/vif/0
Freeing unused kernel memory: 176k freed
Write protecting the kernel read-only data: 379k
BUG: unable to handle kernel paging request at virtual address e100e160
 printing eip:
c0457e5a
00713000 -> *pde = 00000010:1fec1027
BUG: unable to handle kernel paging request at virtual address 15555840
 printing eip:
c060a8f3
00713000 -> *pde = 00000010:1febe027
BUG: unable to handle kernel paging request at virtual address 15555550
 printing eip:
c060a8f31
etc...

Version-Release number of selected component (if applicable):

2.6.18-92.el5

How reproducible:

Every boot on a host with >= 64GiB RAM

Comment 1 Chris Lalancette 2009-01-15 10:01:47 UTC

*** Bug 472290 has been marked as a duplicate of this bug. ***

Comment 2 Chris Lalancette 2009-01-15 14:46:41 UTC

I've uploaded a test kernel that contains this fix (along with several others)
to this location:

http://people.redhat.com/clalance/virttest

Could the original reporter try out the test kernels there, and report back if
it fixes the problem?

Thanks,
Chris Lalancette

Comment 3 Ian Campbell 2009-01-20 11:25:30 UTC

I can confirm that kernel-xen-2.6.18-128.el5virttest4.i686.rpm fixes the issue in a RHEL 5.2 guest on a 128G host. (installed with --nodeps due to an ecryptfs dependency)

(I also initially tested on 4.7 for some confused reason, so FWIW it worked there too ;-))

Comment 4 Chris Lalancette 2009-01-20 11:34:28 UTC

OK, that's great to hear.  Thanks for the testing!

Chris Lalancette

Comment 8 Chris Lalancette 2009-02-22 20:24:45 UTC

*** Bug 486863 has been marked as a duplicate of this bug. ***

Comment 9 Chris Lalancette 2009-03-01 14:16:52 UTC

Created attachment 333646 [details]
Backport of upstream xen-unstable c/s 13549

Comment 10 Don Zickus 2009-03-04 19:59:17 UTC

in kernel-2.6.18-133.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 13 Shawn Rhode 2009-04-06 20:27:53 UTC

Chris,
  Does this bug affect systems with 64GB of RAM that already have some guests running in the same manner?  We have a customer that booted a guest on a server with 64GB of RAM with no other guests running yet and immediately they hit this bug.  However, they then tried booting the same guest (same virtual disk and configuration) on another identical server with 64GB of RAM that already had 4 guests booted that had a total or 14GBytes of memory allocated and in use between the 4 guests (3 x 4GBytes and 1 x 2GBytes) and didn't hit the panic.  The only difference between the two servers is one has guests booted and running and the other one does not.  Would that affect this bug in some way?

Comment 15 Rik van Riel 2009-04-06 21:39:58 UTC

Shawn, your situation is probably due to sheer luck - the top bit of memory (the memory the BIOS remapped above 64GB to make space for IO memory) was already allocated to other guests (presumably fully virtualized ones) so the newly started guest only got memory below 64GB.

Comment 16 Chris Lalancette 2009-04-07 06:40:25 UTC

You also have to be careful about what you are asking for.  This bug specifically affects 32-bit PV guests running on 64-bit dom0.  If that is *not* their situation, then their bug is something else.  If that is their situation, it is possible that this caused it.  It's hard to say, though; do you have stack traces and xm dmesg information from the problem domains?

Chris Lalancette

Comment 17 Shawn Rhode 2009-04-07 13:07:42 UTC

The customer is running a 32bit guest on a 64bit capable hypervisor.  The stack trace is:
Red Hat nash version 5.1.19.6 starting
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:310!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /block/ram0/dev
Modules linked in:
CPU: 2
EIP: 0061:[<c0454a77>] Not tainted VLI
EFLAGS: 00010246 (2.6.18-92.1.10.el5xen #1)
EIP is at release_pages+0x4e/0x137
eax: 00000000 ebx: e12423e0 ecx: 00000000 edx: 00000000
esi: c100988c edi: c1009878 ebp: 00000000 esp: c36c7dbc
ds: 007b es: 007b ss: 0069
Process init (pid: 271, ti=c36c7000 task=c36bf000 task.ti=c36c7000)
Stack: 00000005 00000000 00000000 c3217480 c3217500 c32174e0 c3217520 c123f000
c0199fe8 bfa3bfff bfa3c000 ed79cee4 00000000 00000000 bfa27000 c045e00c
00000000 e12423c0 c100988c 00000005 c1009878 c04644f0 00000005 00000005
Call Trace:
[<c045e00c>] free_pgtables+0x69/0x76
[<c04644f0>] free_pages_and_swap_cache+0x6b/0x7f
[<c045f23f>] exit_mmap+0xb0/0xe4
[<c041f6ee>] mmput+0x25/0x69
[<c0476ecd>] flush_old_exec+0x629/0x8af
[<c046c1b7>] get_unused_fd+0x54/0xb5
[<c0493860>] load_elf_binary+0x494/0x15e7
[<c06094b8>] _spin_lock_irqsave+0x8/0x28
[<c04582b6>] page_address+0x7a/0x81
[<c04589bb>] kmap_high+0x1c/0x2b1
[<c06094b8>] _spin_lock_irqsave+0x8/0x28
[<c04582b6>] page_address+0x7a/0x81
[<c0476072>] search_binary_handler+0x99/0x219
[<c0477a4f>] do_execve+0x158/0x1f5
[<c040337d>] sys_execve+0x2a/0x4a
[<c0405413>] syscall_call+0x7/0xb
=======================
Code: 8b 03 f6 c4 40 74 1d 85 d2 74 0d b0 01 86 82 80 11 00 00 e8 50 5e fc ff 89 d8 e8 8b ff ff ff e9 b8 00 00 00 8b 43 04 85 c0 75 08 <0f> 0b 36 01 b0 b6 62 c0 f0 ff 4b 04 0f 94 c0 84 c0 0f 84 9c 00
EIP: [<c0454a77>] release_pages+0x4e/0x137 SS:ESP 0069:c36c7dbc
<0>Kernel panic - not syncing: Fatal exception

I do not currently have the dmesg output from the hypervisor/domains.  This may be hard to get.  At this point, I think that the information provided thus far is enough to help us with understanding this specific edge case.  If you want further information, you can chat with Chris Tatman, as he is our RedHat TAM and he can provide you with more information into this issue.

Thank you,
Shawn

Comment 18 Chris Ward 2009-07-03 18:02:59 UTC

~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 20 errata-xmlrpc 2009-09-02 09:01:35 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.