Bug 234375 - PV guests can crash at boot w/ >4GB memory
PV guests can crash at boot w/ >4GB memory
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.0
All Linux
high Severity medium
: ---
: ---
Assigned To: Chris Lalancette
Martin Jenner
: OtherQA, Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-03-28 15:03 EDT by Stephen Tweedie
Modified: 2010-10-22 10:05 EDT (History)
3 users (show)

See Also:
Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 14:45:39 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Stephen Tweedie 2007-03-28 15:03:25 EDT
XenSource reports:

"We're seeing some issues with the RHEL5 32b xen kernel that are leading
to frequent XenRT failures. Guests crash during boot when the host has
>4GB of RAM with alarmingly high probability.

The problem is that trying to clear a pte by setting it to PFN 0 can
potentially cause the entry to be temporarily invalid since it writes
the upper word first (i.e. the PTE remains present). It also not correct
to launder the 0 through p2m which set_pte will do. The combination of
these causes a crash when swapper_pg_dir and PFN 0 have MFNs on opposite
sides of the 4G boundary."

Version-Release number of selected component (if applicable):
RHEL-5.0.0

Steps to Reproduce:

"while true; do xm reboot <name>; done" on a machine with plenty of RAM
will result in a guest crash fairly soon. Of course, you'll need to set
preserve on crash to easily see that its happened. 

Additional info:

Should be fixed by xen-unstable cset 12381:

http://xenbits.xensource.com/xen-unstable.hg?cs=c6efd6c2feaa
Comment 2 RHEL Product and Program Management 2007-04-25 16:17:45 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 4 Ian Campbell 2007-07-03 04:16:21 EDT
It turns out that there has been a hypervisor side workaround for this issue for
a while now but it was broken in xen-unstable.hg between
13392:0fd65225e4c6 (17 Jan 2007) and 15433:a5360bf18668 (28 June 2007)
The workaround is in xen/arch/x86/mm.c with the comment (line 3297 in current
xen-unstable):
        /*
         * If this is an upper-half write to a PAE PTE then we assume that
         * the guest has simply got the two writes the wrong way round. We
         * zap the PRESENT bit on the assumption that the bottom half will
         * be written immediately after we return to the guest.
         */

I suspect that the RHEL5 hypervisor has the workaround but doesn't have the
breakage, in which case this can be closed.
Comment 8 Red Hat Bugzilla 2007-07-24 20:44:41 EDT
change QA contact
Comment 9 Chris Lalancette 2007-07-25 14:03:15 EDT
Crap.  I finally was able to reproduce this, just not in the way specified
originally.  I have a 2 CPU Intel SDV here, running the RHEL 5.1 dom0 bits,
i386.  Then I have 1 RHEL-5.0 i386 PV guest running an FTP test that is
saturating the networking.  Finally, I have a 2nd RHEL-5.0 i386 PV guest just
rebooting in a loop (init 6 in /etc/rc.local).  Very often, that 2nd guest will
fail to boot, with this in xm dmesg:

(XEN) mm.c:3267:d6 ptwr_emulate: could not get_page_from_l1e()

After applying c/s 15433 to the HV from xen-unstable, however, I now see this:

(XEN) mm.c:3263:d14 ptwr_emulate: fixing up invalid PAE PTE 0000000149f12025

and the domain successfully reboots.  I believe we are going to need that c/s
for our HV, to support older 5.0 guests.

Chris Lalancette
Comment 10 Don Zickus 2007-07-31 17:14:50 EDT
in 2.6.18-37.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 11 Issue Tracker 2007-08-17 03:26:04 EDT
Fujitsu tested with 5.1 beta, and it worked fine.
----------------------------------
We tested this issue with kernel-xen-2.6.18-37.el5,
the result is OK. We could boot 4 domains.


This event sent from IssueTracker by mmatsuya 
 issue 128803
Comment 14 errata-xmlrpc 2007-11-07 14:45:39 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.