Bug 234375 - PV guests can crash at boot w/ >4GB memory
Summary: PV guests can crash at boot w/ >4GB memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.0
Hardware: All
OS: Linux
high
medium
Target Milestone: ---
: ---
Assignee: Chris Lalancette
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-03-28 19:03 UTC by Stephen Tweedie
Modified: 2018-10-19 23:13 UTC (History)
3 users (show)

Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-07 19:45:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0959 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5 Update 1 2007-11-08 00:47:37 UTC

Description Stephen Tweedie 2007-03-28 19:03:25 UTC
XenSource reports:

"We're seeing some issues with the RHEL5 32b xen kernel that are leading
to frequent XenRT failures. Guests crash during boot when the host has
>4GB of RAM with alarmingly high probability.

The problem is that trying to clear a pte by setting it to PFN 0 can
potentially cause the entry to be temporarily invalid since it writes
the upper word first (i.e. the PTE remains present). It also not correct
to launder the 0 through p2m which set_pte will do. The combination of
these causes a crash when swapper_pg_dir and PFN 0 have MFNs on opposite
sides of the 4G boundary."

Version-Release number of selected component (if applicable):
RHEL-5.0.0

Steps to Reproduce:

"while true; do xm reboot <name>; done" on a machine with plenty of RAM
will result in a guest crash fairly soon. Of course, you'll need to set
preserve on crash to easily see that its happened. 

Additional info:

Should be fixed by xen-unstable cset 12381:

http://xenbits.xensource.com/xen-unstable.hg?cs=c6efd6c2feaa

Comment 2 RHEL Program Management 2007-04-25 20:17:45 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Ian Campbell 2007-07-03 08:16:21 UTC
It turns out that there has been a hypervisor side workaround for this issue for
a while now but it was broken in xen-unstable.hg between
13392:0fd65225e4c6 (17 Jan 2007) and 15433:a5360bf18668 (28 June 2007)
The workaround is in xen/arch/x86/mm.c with the comment (line 3297 in current
xen-unstable):
        /*
         * If this is an upper-half write to a PAE PTE then we assume that
         * the guest has simply got the two writes the wrong way round. We
         * zap the PRESENT bit on the assumption that the bottom half will
         * be written immediately after we return to the guest.
         */

I suspect that the RHEL5 hypervisor has the workaround but doesn't have the
breakage, in which case this can be closed.

Comment 8 Red Hat Bugzilla 2007-07-25 00:44:41 UTC
change QA contact

Comment 9 Chris Lalancette 2007-07-25 18:03:15 UTC
Crap.  I finally was able to reproduce this, just not in the way specified
originally.  I have a 2 CPU Intel SDV here, running the RHEL 5.1 dom0 bits,
i386.  Then I have 1 RHEL-5.0 i386 PV guest running an FTP test that is
saturating the networking.  Finally, I have a 2nd RHEL-5.0 i386 PV guest just
rebooting in a loop (init 6 in /etc/rc.local).  Very often, that 2nd guest will
fail to boot, with this in xm dmesg:

(XEN) mm.c:3267:d6 ptwr_emulate: could not get_page_from_l1e()

After applying c/s 15433 to the HV from xen-unstable, however, I now see this:

(XEN) mm.c:3263:d14 ptwr_emulate: fixing up invalid PAE PTE 0000000149f12025

and the domain successfully reboots.  I believe we are going to need that c/s
for our HV, to support older 5.0 guests.

Chris Lalancette

Comment 10 Don Zickus 2007-07-31 21:14:50 UTC
in 2.6.18-37.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 Issue Tracker 2007-08-17 07:26:04 UTC
Fujitsu tested with 5.1 beta, and it worked fine.
----------------------------------
We tested this issue with kernel-xen-2.6.18-37.el5,
the result is OK. We could boot 4 domains.


This event sent from IssueTracker by mmatsuya 
 issue 128803

Comment 14 errata-xmlrpc 2007-11-07 19:45:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html



Note You need to log in before you can comment on or make changes to this bug.