Red Hat Bugzilla – Bug 234375
PV guests can crash at boot w/ >4GB memory
Last modified: 2010-10-22 10:05:47 EDT
"We're seeing some issues with the RHEL5 32b xen kernel that are leading
to frequent XenRT failures. Guests crash during boot when the host has
>4GB of RAM with alarmingly high probability.
The problem is that trying to clear a pte by setting it to PFN 0 can
potentially cause the entry to be temporarily invalid since it writes
the upper word first (i.e. the PTE remains present). It also not correct
to launder the 0 through p2m which set_pte will do. The combination of
these causes a crash when swapper_pg_dir and PFN 0 have MFNs on opposite
sides of the 4G boundary."
Version-Release number of selected component (if applicable):
Steps to Reproduce:
"while true; do xm reboot <name>; done" on a machine with plenty of RAM
will result in a guest crash fairly soon. Of course, you'll need to set
preserve on crash to easily see that its happened.
Should be fixed by xen-unstable cset 12381:
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
It turns out that there has been a hypervisor side workaround for this issue for
a while now but it was broken in xen-unstable.hg between
13392:0fd65225e4c6 (17 Jan 2007) and 15433:a5360bf18668 (28 June 2007)
The workaround is in xen/arch/x86/mm.c with the comment (line 3297 in current
* If this is an upper-half write to a PAE PTE then we assume that
* the guest has simply got the two writes the wrong way round. We
* zap the PRESENT bit on the assumption that the bottom half will
* be written immediately after we return to the guest.
I suspect that the RHEL5 hypervisor has the workaround but doesn't have the
breakage, in which case this can be closed.
change QA contact
Crap. I finally was able to reproduce this, just not in the way specified
originally. I have a 2 CPU Intel SDV here, running the RHEL 5.1 dom0 bits,
i386. Then I have 1 RHEL-5.0 i386 PV guest running an FTP test that is
saturating the networking. Finally, I have a 2nd RHEL-5.0 i386 PV guest just
rebooting in a loop (init 6 in /etc/rc.local). Very often, that 2nd guest will
fail to boot, with this in xm dmesg:
(XEN) mm.c:3267:d6 ptwr_emulate: could not get_page_from_l1e()
After applying c/s 15433 to the HV from xen-unstable, however, I now see this:
(XEN) mm.c:3263:d14 ptwr_emulate: fixing up invalid PAE PTE 0000000149f12025
and the domain successfully reboots. I believe we are going to need that c/s
for our HV, to support older 5.0 guests.
You can download this test kernel from http://people.redhat.com/dzickus/el5
Fujitsu tested with 5.1 beta, and it worked fine.
We tested this issue with kernel-xen-2.6.18-37.el5,
the result is OK. We could boot 4 domains.
This event sent from IssueTracker by mmatsuya
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.