The kernel needs to ensure that not-present PTEs contain a PFN and not an MFN. This is because the suspend-resume code will not canonicalize not present PTEs since they may contain values which are not PFN or MFNs. This was observed with 2.6.18-8.1.10.el5 but I think it might apply to the rhel4u5 Xen kernel as well. The problem was solved upstream with http://xenbits.xensource.com/xen-unstable.hg?rev/d2dff286994d http://xenbits.xensource.com/kernels/rhel4x.hg?rev/4fd6832bb54f Reproducible by running the attached main.c (gcc -O2 main.c) over a save restore iteration.
Created attachment 198411 [details] test case to reproduce issue
Created attachment 288051 [details] Patch to fix protnone problem during PV save/restore OK, I was able to easily reproduce the problem, and come up with a backport. This backport is really a combination of 3 changesets from upstream Xen-unstable: 12402, 13998, 14006. It's a pretty straightforward backport; I only had to remove one chunk (that removed a line we didn't have to begin with), and to adjust line numbers appropriately. Before the patch, running the test program in a loop would end up in a reliable oops when doing a save/restore cycle on the domain. I'm still testing, but at least on i686 I've gone 20 save/restore cycles already without oopsing. Chris Lalancette
Created attachment 288751 [details] xen-unstable 12402:ade94aa072c5 ported to 2.6.9-67.EL
Created attachment 288761 [details] xen-unstable 12545:50467f56ed65 ported to 2.6.9-67.EL
Created attachment 288771 [details] xen-unstable 13998:d2dff286994d ported to 2.6.9-67.EL
Created attachment 288781 [details] xen-unstable 12402:ade94aa072c5 ported to 2.6.18-53.el5
Created attachment 288791 [details] xen-unstable 12545:50467f56ed65 ported to 2.6.18-53.el5
Created attachment 288801 [details] xen-unstable 13998:d2dff286994d ported to 2.6.18-53.el5
We recently stopped using the rhel4x.hg port from xenbits and switched to using a set of targetted fixes to your kernels. I have attached the patches from our queue relevant to this issue. I'm not sure why it never occured to me to attach our rhel5x version of the patches here (it seems a pretty obvious thing to do now I think about it!). I see you've got a patch of your own now but I've attached our versions in case they are of use.
in 2.6.18-66.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Is there a fix for this in the PAE kernel as well? I am seeing the problem in 2.6.18-53PAE. Thanks
No, that doesn't make sense. This is specifically for a RHEL-5 Xen PV kernel, not for a bare-metal kernel. You seem to be having a different issue; please open a different BZ about it. Chris Lalancette
Don, do your more recent kernel builds still contain the fix? I'm seeing this bug and would be happy to test, just wanted to make sure they're patched appropriately.
Yep, these fixes should be in later 5.2 kernel builds. Note that there were additional patches that went into -89 to fix up a regression caused by this patchset, so you'll want to test later than that. Any testing is welcome! Thanks, Chris Lalancette
Tested with -91, no failures. This appears to fix the crash.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html