234375 – PV guests can crash at boot w/ >4GB memory

Bug 234375 - PV guests can crash at boot w/ >4GB memory

Summary: PV guests can crash at boot w/ >4GB memory

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Chris Lalancette
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-03-28 19:03 UTC by Stephen Tweedie
Modified:	2018-10-19 23:13 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2007-0959
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 19:45:39 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0959	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5 Update 1	2007-11-08 00:47:37 UTC

Description Stephen Tweedie 2007-03-28 19:03:25 UTC

XenSource reports:

"We're seeing some issues with the RHEL5 32b xen kernel that are leading
to frequent XenRT failures. Guests crash during boot when the host has
>4GB of RAM with alarmingly high probability.

The problem is that trying to clear a pte by setting it to PFN 0 can
potentially cause the entry to be temporarily invalid since it writes
the upper word first (i.e. the PTE remains present). It also not correct
to launder the 0 through p2m which set_pte will do. The combination of
these causes a crash when swapper_pg_dir and PFN 0 have MFNs on opposite
sides of the 4G boundary."

Version-Release number of selected component (if applicable):
RHEL-5.0.0

Steps to Reproduce:

"while true; do xm reboot <name>; done" on a machine with plenty of RAM
will result in a guest crash fairly soon. Of course, you'll need to set
preserve on crash to easily see that its happened. 

Additional info:

Should be fixed by xen-unstable cset 12381:

http://xenbits.xensource.com/xen-unstable.hg?cs=c6efd6c2feaa

Comment 2 RHEL Program Management 2007-04-25 20:17:45 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Ian Campbell 2007-07-03 08:16:21 UTC

It turns out that there has been a hypervisor side workaround for this issue for
a while now but it was broken in xen-unstable.hg between
13392:0fd65225e4c6 (17 Jan 2007) and 15433:a5360bf18668 (28 June 2007)
The workaround is in xen/arch/x86/mm.c with the comment (line 3297 in current
xen-unstable):
        /*
         * If this is an upper-half write to a PAE PTE then we assume that
         * the guest has simply got the two writes the wrong way round. We
         * zap the PRESENT bit on the assumption that the bottom half will
         * be written immediately after we return to the guest.
         */

I suspect that the RHEL5 hypervisor has the workaround but doesn't have the
breakage, in which case this can be closed.

Comment 8 Red Hat Bugzilla 2007-07-25 00:44:41 UTC

change QA contact

Comment 9 Chris Lalancette 2007-07-25 18:03:15 UTC

Crap.  I finally was able to reproduce this, just not in the way specified
originally.  I have a 2 CPU Intel SDV here, running the RHEL 5.1 dom0 bits,
i386.  Then I have 1 RHEL-5.0 i386 PV guest running an FTP test that is
saturating the networking.  Finally, I have a 2nd RHEL-5.0 i386 PV guest just
rebooting in a loop (init 6 in /etc/rc.local).  Very often, that 2nd guest will
fail to boot, with this in xm dmesg:

(XEN) mm.c:3267:d6 ptwr_emulate: could not get_page_from_l1e()

After applying c/s 15433 to the HV from xen-unstable, however, I now see this:

(XEN) mm.c:3263:d14 ptwr_emulate: fixing up invalid PAE PTE 0000000149f12025

and the domain successfully reboots.  I believe we are going to need that c/s
for our HV, to support older 5.0 guests.

Chris Lalancette

Comment 10 Don Zickus 2007-07-31 21:14:50 UTC

in 2.6.18-37.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 Issue Tracker 2007-08-17 07:26:04 UTC

Fujitsu tested with 5.1 beta, and it worked fine.
----------------------------------
We tested this issue with kernel-xen-2.6.18-37.el5,
the result is OK. We could boot 4 domains.


This event sent from IssueTracker by mmatsuya 
 issue 128803

Comment 14 errata-xmlrpc 2007-11-07 19:45:39 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.