Bug 503737 - [RHEL5.4 Xen]: Trying to boot a FV -PAE kernel crashes
[RHEL5.4 Xen]: Trying to boot a FV -PAE kernel crashes
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.4
All Linux
high Severity high
: rc
: 5.4
Assigned To: Bhavna Sarathy
Red Hat Kernel QE team
: Regression
Depends On:
Blocks: 438405
  Show dependency treegraph
 
Reported: 2009-06-02 09:20 EDT by Chris Lalancette
Modified: 2009-09-02 04:18 EDT (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:18:09 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Debugging patch that reverts the 2MB page stuff (13.63 KB, patch)
2009-06-10 03:42 EDT, Chris Lalancette
no flags Details | Diff
Disable 2MB support for PAE (909 bytes, patch)
2009-06-17 22:00 EDT, Bhavna Sarathy
no flags Details | Diff

  None (edit)
Description Chris Lalancette 2009-06-02 09:20:24 EDT
Description of problem:
I was testing out an i686 dom0, with kernel-xen 2.6.18-151.el5 and xen-3.0.3-86.el5.  When trying to boot an RHEL-5.4 FV i686 guest -PAE kernel, the guest kept hanging up.  I determined that the guest was hanging because the QEMU device model was crashing.  The last thing I see in the qemu-dm logs is:

qemu: hardware error: Invalid ioreq type 0xff

This is not *always* reproducible; sometimes it works, and sometimes it doesn't.  Also, reverting back to a 5.3 kernel and userland showed about the same frequency of failure, so I don't think this is technically a regression.

Jirka also tested this out and found that a 4.6 -smp kernel could also show this behavior, so it's not specific to the RHEL-5 -PAE kernel.
Comment 1 Michal Novotny 2009-06-02 09:32:26 EDT
Well, I've been investigating this now and discussed this issue with Jirka and the code here calls xc_evtchn_bind_interdomain() function with the argument of ioreq shared page that gets filled by this function's ioctl call. It appears kernel somehow sets "0xff" type value to this shared_page->vcpu_iodata[cpu].vp_ioreq that's not known in qemu-dm process itself so maybe this could be kernel-side issue that it sets the shared_page for this vcpu to an invalid ioreq type here.

The ioctl call here calls IOCTL_EVTCHN_BIND_INTERDOMAIN ioctl call so I am not sure this is user-side thing.
Comment 2 Jiri Denemark 2009-06-03 03:54:20 EDT
Just a note, that on my machine, once I stop seeing that error, rebooting the machine (host) helps and I start seeing it again.
Comment 3 Michal Novotny 2009-06-03 07:37:46 EDT
Ok, thanks. In fact investigation made me think that kernel-xen somehow sets the shared_page->vcpu_iodata[cpu].vp_ioreq->type to 0xff using this ioctl() call within xc_evtchn_bind_interdomain() call (called from qemu-dm process itself). Kernel-xen somehow gets to strange state to setup vp_ioreq->type to 0xff which results to call hw_error() and stop qemu-dm process with this error message. The problem is that we can't know what ioreq operation should be there instead. Like it's accidentally set up to 0xff for ioreq ops instead of proper values.
Comment 4 Chris Lalancette 2009-06-09 07:44:09 EDT
OK.  I did some poking around here.  The "hardware error" reported by qemu comes from tools/ioemu/target-i386-dm/helper2.c:__handle_ioreq().  The call chain to get there is actually:

cpu_handle_ioreq() -> handle_buffered_io() -> __handle_buffered_iopage() -> __handle_ioreq()

We get to that hardware error iff the type of the request is invalid (in this case, the type is 0xff).  Looking up the call chain, the req variable is figured out in __handle_buffered_iopage().  Basically, the buffered_io_page is a page shared between the hypervisor and the qemu-dm process, and is mapped into the qemu-dm process on first start.  It acts as a ring; the hypervisor places data into it, updates the ->write_pointer, and qemu-dm pulls data out and updates the ->read_pointer accordingly.

So, with that in mind, I added debugging into the qemu-dm device model.  What I'm seeing is that initially, the read_pointer and write_pointer (from the point of view of qemu-dm) all look sane, and things are proceeding.  But at some point, the read_pointer goes off it's rocker:

read_pointer is 954, write_pointer is 955, req slot is 74
read_pointer is 955, write_pointer is 956, req slot is 75
read_pointer is 956, write_pointer is 957, req slot is 76
read_pointer is 957, write_pointer is 958, req slot is 77
read_pointer is 958, write_pointer is 959, req slot is 78
read_pointer is 959, write_pointer is 960, req slot is 79
read_pointer is -2147482624, write_pointer is 1, req slot is 32
qemu: hardware error: Invalid ioreq type 0xff

And, of course, that's when things explode.

Now, booting back to the -128 HV, with a -152 dom0 kernel, I can't seem to reproduce the problem anymore.  That would seem to suggest that something in the HV is tromping over the buffered_io_page.  I'm going to try to find out what's doing that.  However, just to be certain this isn't a "heisenbug", I'm also having Jirka try to reproduce the problem with the -128 HV.

Chris Lalancette
Comment 5 Chris Lalancette 2009-06-09 10:36:07 EDT
OK.  After quite a few iterations of testing and bisecting, I think I've narrowed this down to commit 7b81d2426a20a80658af6b2e91dd6913f2ac4af2 in the xen tree, which corresponds to:

AMD 2MB backing pages support

Which was added in the 2.6.18-111.el5 kernel.  Previous to 111, everything was fine, but in 111 and after, I can reproduce the issue.  I'm still reading through the patch to try to better understand where the problem might be.

Chris Lalancette
Comment 6 Chris Lalancette 2009-06-10 03:42:07 EDT
Created attachment 347166 [details]
Debugging patch that reverts the 2MB page stuff

The attached patch applies to the -152 hypervisor and is the minimal patch to back out the 2MB page functionality.  With this applied, I no longer see the corruption, but obviously we also don't have 1GB/2MB page support.  Bhavna, can you and AMD take a look and see if you can see any problems with this code?  I can reproduce the problem fairly reliably (not quite 100% of the time, but close enough that I can test fixes), so let me know if you have patches you want me to test.

Chris Lalancette
Comment 7 Bhavna Sarathy 2009-06-10 10:32:20 EDT
Chris, we are looking into this bug immediately.  What is the dom0 size?   Using dom0_mem (I think that is the option) can you shrink the dom0 size to see if that makes a difference?   

Bhavna
Comment 8 Chris Lalancette 2009-06-10 11:10:28 EDT
(In reply to comment #7)
> Chris, we are looking into this bug immediately.  What is the dom0 size?  

By size, I guess you mean memory size?  This is a barcelona machine (2x4 core) with 16GB of memory, running the i686 dom0.  By default, we give all memory to the dom0 kernel, and balloon it down, so dom0 to start with has the full amount (16GB of memory).

As far as the guest is concerned, it probably shouldn't matter, but my guest is configured to boot with 1GB of memory, and the -152PAE kernel.

> Using dom0_mem (I think that is the option) can you shrink the dom0 size to see
> if that makes a difference?

For this particular test, I went back to the -128 kernel in the dom0.  I first verified that booting *without* any dom0_mem parameter still reproduced the bug (which it did).  Then I booted the hypervisor with parameter "dom0_mem=1G", and tried the test again.  The guest booted.  Then I tried exactly the same thing 4 more times (for a total of 5 runs), and in all 5 runs, adding the dom0_mem=1G parameter worked around the bug.  That seems to point to a contributing factor here, which is good.  Jirka, if you have a few moments, could you repeat this experiment on your test rig, just to ensure I didn't get flukey results?

Thanks,
Chris Lalancette
Comment 9 Bhavna Sarathy 2009-06-10 12:25:15 EDT
Yeah, I meant memory size.  With the full memory of 16GB, you lose disk access, right?   Do you know what the cut off point is?  8GB, 4GB?   Also I notice that you are using 32-bit kernel and hypervisor.   Same issue with 64-bit?  

Bhavna
Comment 10 WEI HUANG 2009-06-10 18:18:17 EDT
According to Chris, the problem can be reproduced back to 2.6.18-111 el5 kernel. So I decided to use RHEL 5.3 to debug it. 

I just installed RHEL 5.3 32bit Xen on my box and ran a freshly-installed RHEL 5.3 PAE guest. Here is the configuration:

=======================
1. Host
* Xen hypervisor
xen_major              : 3
xen_minor              : 1
xen_extra              : .2-128.el5
xen_caps               : xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p 

* Dom0
2.6.18-128.el5xen #1 SMP Wed Dec 17 12:22:24 EST 2008 i686 athlon i386 GNU/Linux

2. Guest
RHEL 5.3 (2.6.18-128.el5PAE SMP i686 athlon i386)
=======================

The system had 8GB memory and I did not specify dom0_mem option in Xen kernel. On this platform, the guest can boot without any issue.  By comparing my configuration with failed ones, I think the major difference is the guest kernel and system memory sizes. 

Chris, could you verify my configuration and compare with yours? Also could you use older guest kernel to run test again? I don't have RHEL 5.4 kernel at hand for now. 

Thanks,
Comment 11 Jiri Denemark 2009-06-11 04:01:15 EDT
I have an Intel box (2x4 core) with 10GB of memory where I can reproduce this bug. It doesn't seem there is any magic cut off point as even dom0_mem=9600M, which results in 549MB of free memory reported by xm info, helps to work around this bug. Going above those 9600M makes it fail again.

Jirka
Comment 12 Chris Lalancette 2009-06-11 05:22:03 EDT
(In reply to comment #9)
> Yeah, I meant memory size.  With the full memory of 16GB, you lose disk access,
> right?   Do you know what the cut off point is?  8GB, 4GB?   Also I notice that
> you are using 32-bit kernel and hypervisor.   Same issue with 64-bit?  

Yes, that's correct.  I've only seen the issue with a 32-bit dom0, I have not seen it with a 64-bit dom0.  Given the nature of the problem (seeming data corruption in the HV), I would suspect the problem is a difference in data type sizes, but I'm not very familiar with this code.  I think Jiri's reply in Comment #11 answers the "cut-off" question; there doesn't seem to be one, or at least, we can't find it.

(In reply to comment #10)
> According to Chris, the problem can be reproduced back to 2.6.18-111 el5
> kernel. So I decided to use RHEL 5.3 to debug it. 

Right, I've now mostly been testing with 5.3.

> 
> I just installed RHEL 5.3 32bit Xen on my box and ran a freshly-installed RHEL
> 5.3 PAE guest. Here is the configuration:
> 
> =======================
> 1. Host
> * Xen hypervisor
> xen_major              : 3
> xen_minor              : 1
> xen_extra              : .2-128.el5
> xen_caps               : xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p 
> 
> * Dom0
> 2.6.18-128.el5xen #1 SMP Wed Dec 17 12:22:24 EST 2008 i686 athlon i386
> GNU/Linux
> 
> 2. Guest
> RHEL 5.3 (2.6.18-128.el5PAE SMP i686 athlon i386)
> =======================
> 
> The system had 8GB memory and I did not specify dom0_mem option in Xen kernel.
> On this platform, the guest can boot without any issue.  By comparing my
> configuration with failed ones, I think the major difference is the guest
> kernel and system memory sizes. 
> 
> Chris, could you verify my configuration and compare with yours? Also could you
> use older guest kernel to run test again? I don't have RHEL 5.4 kernel at hand
> for now. 

Right.  My test box is a barcelona with 16GB of memory.  I've gone back and tested with a -128 dom0 kernel and a -128 PAE FV guest kernel, and I could reproduce the problem.  Interestingly, I also can never reproduce the problem with a normal (non-PAE) kernel either.  It may just be how the HV allocates memory and lays it out, so that the corruption goes unnoticed in certain circumstances.  But it's all just speculation, since I don't really know for sure where the bug is.

Let me know if there are more tests or patches you want us to try out, I'm happy to do so.

Chris Lalancette
Comment 13 Bhavna Sarathy 2009-06-11 09:02:32 EDT
(In reply to comment #11)
> I have an Intel box (2x4 core) with 10GB of memory where I can reproduce this
> bug. It doesn't seem there is any magic cut off point as even dom0_mem=9600M,
> which results in 549MB of free memory reported by xm info, helps to work around
> this bug. Going above those 9600M makes it fail again.
> Jirka  

That's good info, let's take note that the issue affects both Intel and AMD, and is in common code path.   At least we have a workaround of limiting dom0 memory. Not sure 9600M is a hard limit or if there is play between assigned memory and free memory.

Wei, latest source is here for RHEL5.4 dev.
http://people.redhat.com/dzickus/el5/152.el5/src/

Bhavna
Comment 14 WEI HUANG 2009-06-12 11:21:59 EDT
I was out of office and wasn't able to look at it. 

With 16G memory installed on my system, I can reproduce this issue immediately: RHEL5.3 PAE guest failed, but 32bit seemed working fine. I am currently looking at it now.

-Wei
Comment 15 Chris Lalancette 2009-06-15 05:21:14 EDT
(In reply to comment #14)
> I was out of office and wasn't able to look at it. 
> 
> With 16G memory installed on my system, I can reproduce this issue immediately:
> RHEL5.3 PAE guest failed, but 32bit seemed working fine. I am currently looking
> at it now.

Excellent!  Let me know if you need any other information or testing help, I'm glad to provide it.

Chris Lalancette
Comment 16 WEI HUANG 2009-06-15 11:46:45 EDT
Look at the code of hap and 2MB, I found the following code snip in hvm.c:

/*
 * Xen command-line option to allow/disallow hardware-assisted paging.
 * Since the phys-to-machine table of AMD NPT is in host format, 32-bit Xen
 * can only support guests using NPT with up to a 4GB memory map. Therefore
 * we disallow HAP by default on PAE Xen (by default we want to support an
 * 8GB pseudophysical memory map for HVM guests on a PAE host).
 */
static int opt_hap_permitted = (CONFIG_PAGING_LEVELS != 3);
boolean_param("hap", opt_hap_permitted);


In other words, HAP is disabled in default in PAE mode. Since 2MB was intended for HAP only, it will be weird to see 2MB enabled with HAP disabled.

I purposely enabled hap in kernel xen (hap=1) and re-ran the test. The PAE guest did boot. So I want to ask RedHat to test this quickly. If hap=1 fixes the problem, it should be the direction for debugging.

Thanks,

-Wei
Comment 17 Chris Lalancette 2009-06-16 04:46:37 EDT
Yes, that also seems to fix the problem in my testing.  If I remember correctly, we disable HAP by default for 32-bit because you can't create FV guests > 4GB with it on, due to a silicon limitation.  Since you are saying that 2MB pages are only intended to work with HAP, then that is probably an untested code path.

Chris Lalancette
Comment 19 Bhavna Sarathy 2009-06-17 22:00:56 EDT
Created attachment 348369 [details]
Disable 2MB support for PAE
Comment 20 Bhavna Sarathy 2009-06-17 22:05:05 EDT
Can you review the patch and let me know if this acceptable and works for you?
If this works, then I'd like to request the appropriate BZ flags be set.   I can then submit the patch tomorrow morning.

The following test matrix has been tested.

Guests:
     windows XP 32bit 1G, 6G (guest will see 4G only)
     RHEL 5.3 32bit 1G, 6G (guest will see 4G only)
     RHEL 5.3 PAE 1G, 6G
     RHEL 5.4 64bit 1G, 6G (won't work on kernel Xen PAE) Kernel Xen PAE
    - hap=0, hap_1gb=0
    - hap=1, hap_1gb=0
    - hap=0, hap_1gb=1
    - hap=1, hap_1gb=1
Kernel Xen 64bit
    - hap=0, hap_1gb=0
    - hap=1, hap_1gb=0
    - hap=0, hap_1gb=1
    - hap=1, hap_1gb=1
Comment 21 Chris Lalancette 2009-06-18 05:57:57 EDT
Bhavna,
     I will indeed give this patch a try.  It looks like this also should fix the hap=0 case on x86_64, which was another case that I was worried about.  I'll report back my testing results soon.  Are you planning to push a patch similar to this to upstream Xen?

Chris Lalancette
Comment 22 Chris Lalancette 2009-06-18 07:10:17 EDT
Bhavna,
    Oh, one other question.  Do you mind if I run this by Intel quickly, just to make sure they don't object?

Chris Lalancette
Comment 23 Bhavna Sarathy 2009-06-18 09:30:15 EDT
(In reply to comment #22)
> Bhavna,
>     Oh, one other question.  Do you mind if I run this by Intel quickly, just
> to make sure they don't object?
> Chris Lalancette  

No problem, go ahead.  They'll get to see it when the patch is posted on RHML, but might as well get the agreement now.   As long as no silicon issues are revealed. :)

Bhavna
Comment 24 Bhavna Sarathy 2009-06-18 09:34:47 EDT
"disable 2MB support when hap is turned off"
I have a correction on the description, sorry about that.
Comment 25 Bhavna Sarathy 2009-06-18 10:48:21 EDT
(In reply to comment #21)
> Bhavna,
>      I will indeed give this patch a try.  It looks like this also should fix
> the hap=0 case on x86_64, which was another case that I was worried about. 
> I'll report back my testing results soon.  Are you planning to push a patch
> similar to this to upstream Xen?
> Chris Lalancette  

Thanks Chris, please let me know when I can submit patch to the mailing list and get it in the queue.  Wei will check, and submit patch upstream if needed.
Bhavna
Comment 26 Chris Lalancette 2009-06-18 16:13:10 EDT
(In reply to comment #25)
> (In reply to comment #21)
> > Bhavna,
> >      I will indeed give this patch a try.  It looks like this also should fix
> > the hap=0 case on x86_64, which was another case that I was worried about. 
> > I'll report back my testing results soon.  Are you planning to push a patch
> > similar to this to upstream Xen?
> > Chris Lalancette  
> 
> Thanks Chris, please let me know when I can submit patch to the mailing list
> and get it in the queue.  Wei will check, and submit patch upstream if needed.

OK, seems to fix the problems in my testing, at least.  I only tested out a few situations, but it looks good to me so far.  I'll do a bit more testing tomorrow.  Thanks for looking at this and the quick turnaround!

Chris Lalancette
Comment 27 Bhavna Sarathy 2009-06-19 13:43:55 EDT
Posted to RHML list, please set all BZ ACKs.
Comment 28 Bhavna Sarathy 2009-06-19 13:45:13 EDT
> OK, seems to fix the problems in my testing, at least.  I only tested out a few
> situations, but it looks good to me so far.  I'll do a bit more testing
> tomorrow.  Thanks for looking at this and the quick turnaround!
> 
> Chris Lalancette  

No problem, thanks for the testing at your end.

Bhavna
Comment 31 Don Zickus 2009-06-30 16:22:27 EDT
in kernel-2.6.18-156.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 33 Chris Ward 2009-07-03 14:44:52 EDT
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.
Comment 34 Chris Ward 2009-07-10 15:13:46 EDT
~~ Attention Partners - RHEL 5.4 Snapshot 1 Released! ~~

RHEL 5.4 Snapshot 1 has been released on partners.redhat.com. If you have already reported your test results, you can safely ignore this request. Otherwise, please notice that there should be a fix available now that addresses this particular request. Please test and report back your results here, at your earliest convenience. The RHEL 5.4 exception freeze is quickly approaching.

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Do not flip the bug status to VERIFIED. Instead, please set your Partner ID in the Verified field above if you have successfully verified the resolution of this issue. 

Further questions can be directed to your Red Hat Partner Manager or other appropriate customer representative.
Comment 35 Frank Arnold 2009-08-07 07:08:33 EDT
Verification from AMD's side


All tests were performed on the following machine:
  Hostname: schwertleite.osrc.amd.com
  Platform: Toonie 2
  CPUs    : 2x AMD Family 16, Model 4, Stepping 2 (Shanghai)
  Memory  : 16 GB


Basic functionality tests:
  - Installed RHEL 5.4 i386/x86_64
  - Changed hypervisor parameters to match the matrix mentioned in comment #20
  - Started different guests and verified the memory seen inside the guests
  - For the guests run on 32-bit Xen without HAP it wasn't possible to
    start guests with 8 GB assigned, thus reduced it to 7GB
  - For guests started on 32-bit Xen with HAP enabled it is not possible
    to start them with more than 3840 MB (hardware limitation in conjunction
    with HAP, correctly errors out while trying to create the guest)

  1. kernel-xen-2.6.18-160.el5.i686, hap=0 hap_1gb=0
     - Windows 2008 32-bit guest, 1024MB  [ PASS ]
     - Windows 2008 32-bit guest, 7168MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     1024MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     7168MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 1024MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 7168MB  [ PASS ]

  2. kernel-xen-2.6.18-160.el5.i686, hap=1 hap_1gb=0
     - Windows 2008 32-bit guest, 1024MB  [ PASS ]
     - Windows 2008 32-bit guest, 3840MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     1024MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     3840MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 1024MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 3840MB  [ PASS ]

  3. kernel-xen-2.6.18-160.el5.i686, hap=1 hap_1gb=1
     - Windows 2008 32-bit guest, 1024MB  [ PASS ]
     - Windows 2008 32-bit guest, 3840MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     1024MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     3840MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 1024MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 3840MB  [ PASS ]

  4. kernel-xen-2.6.18-161.el5.x86_64, hap=0 hap_1gb=0
     - Windows 2008 32-bit guest, 1024MB  [ PASS ]
     - Windows 2008 32-bit guest, 8192MB  [ PASS ]
     - Windows 2008 64-bit guest, 1024MB  [ PASS ]
     - Windows 2008 64-bit guest, 8192MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     1024MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     8192MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 1024MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 8192MB  [ PASS ]
     - RHEL 5.3 64-bit guest,     1024MB  [ PASS ]
     - RHEL 5.3 64-bit guest,     8192MB  [ PASS ]

  5. kernel-xen-2.6.18-161.el5.x86_64, hap=1 hap_1gb=0
     - Windows 2008 32-bit guest, 1024MB  [ PASS ]
     - Windows 2008 32-bit guest, 8192MB  [ PASS ]
     - Windows 2008 64-bit guest, 1024MB  [ PASS ]
     - Windows 2008 64-bit guest, 8192MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     1024MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     8192MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 1024MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 8192MB  [ PASS ]
     - RHEL 5.3 64-bit guest,     1024MB  [ PASS ]
     - RHEL 5.3 64-bit guest,     8192MB  [ PASS ]

  6. kernel-xen-2.6.18-161.el5.x86_64, hap=1 hap_1gb=1
     - Windows 2008 32-bit guest, 1024MB  [ PASS ]
     - Windows 2008 32-bit guest, 8192MB  [ PASS ]
     - Windows 2008 64-bit guest, 1024MB  [ PASS ]
     - Windows 2008 64-bit guest, 8192MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     1024MB  [ PASS ]
     - RHEL 5.3 32-bit guest,     8192MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 1024MB  [ PASS ]
     - RHEL 5.3 32-bit PAE guest, 8192MB  [ PASS ]
     - RHEL 5.3 64-bit guest,     1024MB  [ PASS ]
     - RHEL 5.3 64-bit guest,     8192MB  [ PASS ]


Reproducing this bug on the above mentioned hardware
  - Installed a RHEL 5.3 32-bit Xen server
  - Explicitly turned off HAP
  - Started a RHEL 5.3 32-bit PAE guest with 1024MB memory and 1 VCPU
  - It took 3 cycles (rebooting host, starting guest) to reproduce the issue


Overnight verification
  - Installed RHEL 5.4 Snapshot 5 32-bit Xen server
  - Explicitly turned off HAP and IOMMU
  - Set up the guest for immediate poweroff after startup
    (added /sbin/poweroff to /etc/rc.local)
  - Invoked following script from the hosts /etc/rc.local:

    #!/bin/bash
    echo "$(/bin/date)  Starting guest" >>/root/503737.log
    /usr/sbin/xm create /xen/images/guest.svm
    while /bin/true; do
        /bin/sleep 30
        if [[ $(/usr/sbin/xm list | /usr/bin/wc -l) -eq 2 ]]; then
            echo "$(/bin/date)  Restarting system" >>/root/503737.log
            /usr/bin/reboot
        fi
    done

  - So basically the host was rebooted, the guest was started, the host
    did wait for the guest to powerdown again, and then rebooted itself
  - This procedure ran for about 15 hours, resulting in 150 successful cycles
Comment 37 errata-xmlrpc 2009-09-02 04:18:09 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.