XenSource reports: "We're seeing some issues with the RHEL5 32b xen kernel that are leading to frequent XenRT failures. Guests crash during boot when the host has >4GB of RAM with alarmingly high probability. The problem is that trying to clear a pte by setting it to PFN 0 can potentially cause the entry to be temporarily invalid since it writes the upper word first (i.e. the PTE remains present). It also not correct to launder the 0 through p2m which set_pte will do. The combination of these causes a crash when swapper_pg_dir and PFN 0 have MFNs on opposite sides of the 4G boundary." Version-Release number of selected component (if applicable): RHEL-5.0.0 Steps to Reproduce: "while true; do xm reboot <name>; done" on a machine with plenty of RAM will result in a guest crash fairly soon. Of course, you'll need to set preserve on crash to easily see that its happened. Additional info: Should be fixed by xen-unstable cset 12381: http://xenbits.xensource.com/xen-unstable.hg?cs=c6efd6c2feaa
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
It turns out that there has been a hypervisor side workaround for this issue for a while now but it was broken in xen-unstable.hg between 13392:0fd65225e4c6 (17 Jan 2007) and 15433:a5360bf18668 (28 June 2007) The workaround is in xen/arch/x86/mm.c with the comment (line 3297 in current xen-unstable): /* * If this is an upper-half write to a PAE PTE then we assume that * the guest has simply got the two writes the wrong way round. We * zap the PRESENT bit on the assumption that the bottom half will * be written immediately after we return to the guest. */ I suspect that the RHEL5 hypervisor has the workaround but doesn't have the breakage, in which case this can be closed.
change QA contact
Crap. I finally was able to reproduce this, just not in the way specified originally. I have a 2 CPU Intel SDV here, running the RHEL 5.1 dom0 bits, i386. Then I have 1 RHEL-5.0 i386 PV guest running an FTP test that is saturating the networking. Finally, I have a 2nd RHEL-5.0 i386 PV guest just rebooting in a loop (init 6 in /etc/rc.local). Very often, that 2nd guest will fail to boot, with this in xm dmesg: (XEN) mm.c:3267:d6 ptwr_emulate: could not get_page_from_l1e() After applying c/s 15433 to the HV from xen-unstable, however, I now see this: (XEN) mm.c:3263:d14 ptwr_emulate: fixing up invalid PAE PTE 0000000149f12025 and the domain successfully reboots. I believe we are going to need that c/s for our HV, to support older 5.0 guests. Chris Lalancette
in 2.6.18-37.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Fujitsu tested with 5.1 beta, and it worked fine. ---------------------------------- We tested this issue with kernel-xen-2.6.18-37.el5, the result is OK. We could boot 4 domains. This event sent from IssueTracker by mmatsuya issue 128803
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html