Description of problem:
We see random kernel panics after installing RHEL 4.5 as a paravirtualized guest
of RHEL 5.0. The RHEL 5.0 kernel is 2.6.18-8.1.1.el5xen. The RHEL 5.0 system
is up to date as of yesterday 3/22/07.
Version-Release number of selected component (if applicable):
The RHEL 4.5 kernel is 2.6.9-48.ELxenU.
We can't reliably reproduce it yet. It has happened during the first reboot the
installer does. It has also happened after the system has been up for a little
while (10s of minutes at least). We also have guests on which it has not
Steps to Reproduce:
Here's the output from one panic. It's what was on the screen after we ran 'xm
create -c rhel45_1'. We didn't have the serial console set up so, presumably
there's a lot missing between "No controller found" and "cut here". The serial
console is set up now so hopefully we'll get another panic and be able to
provide more details.
tc: IRQ 8 is not free.
i8042.c: No controller found.
------------[ cut here ]------------
kernel BUG at arch/i386/mm/pgtable-xen.c:306!
invalid operand: 0000 [#1]
Modules linked in: dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod xenblk sd_mod
EIP: 0061:[<c011163a>] Not tainted VLI
EFLAGS: 00010282 (2.6.9-48.ELxenU)
EIP is at pgd_ctor+0x1d/0x26
eax: fffffff4 ebx: 00000000 ecx: f5392000 edx: 00000000
esi: c2202d80 edi: ed785860 ebp: 00000001 esp: ec4cfde4
ds: 007b es: 007b ss: 0068
Process hotplug (pid: 465, threadinfo=ec4cf000 task=ec4d57f0)
Stack: c0141a69 ecbe0000 c2202d80 00000001 ecbe0000 ed785860 c2202d80 c2202e40
c0141beb c2202d80 ed785860 00000001 c2202d80 ed785860 ecbe0000 00000010
00000001 000000d0 c2229080 0000000c c2202e08 c2202d80 c0141dda c2202d80
Code: 74 02 66 a5 a8 01 74 01 a4 5e 5b 5e 5f c3 80 3d 04 f7 2e c0 00 75 1c 6a 20
6a 00 ff 74 24 0c e8 ce 37 00 00 83 c4 0c 85 c0 74 08 <0f> 0b 32 01 8e 2a 2
7 c0 c3 80 3d 04 f7 2e c0 00 75 0d c7 44 24
<0>Fatal exception: panic in 5 seconds
Please provide following info:
-- xen config file (from /etc/xen/)
32-bit guest on 32-bit hypervisor/kernel
64-bit guest on 64-bit hypervisor/kernel
total memory in system (config file should show guest mem allocation)
We moved on long ago.
When this happened I talked to Red Hat support and they said "probably fixed in
the soon-to-be-released 4.5 but you can't have that to test it out." So I gave up.
I'd resolve the bug but I'm not sure what status is appropriate.
Please leave the bug open; I think I now have a fix, and I will need it for
I believe we need a combination of this c/s:
Along with fixing up "xen_pfn_to_cr3" in drivers/xen/core/smpboot.c to fix this
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
OK. I was able to figure out how to reliably reproduce it here:
1) 16GB box
2) Create one guest that is large (say, 7200MB)
3) "xm info | grep free_memory"
4) Create a second guest that is exactly the size from the last command
Now that I can do it reliably, I'll try out a few things to see whether this is
truly fixed already.
Tested so far on RHEL 5.0 dom0, xm info reports 15359MB total memory, dom0
clamped to 512MB of memory:
a) 2 RHEL 4.5 guests, one 7200MB, the second 7524MB to make sure free_memory ==
0. Result: first domain starts properly, second panic's with stack trace from
earlier in this BZ.
b) 1 RHEL 4.5, 1 4.6 guest, same sizes as above. Result: first one starts
properly, second panic's when trying to execve() init.
c) 2 RHEL 5 guests, same sizes as above. Result: both boot OK.
So it seems while we are coming up towards limits in the HV, this may still be a
problem with the RHEL-4 kernel. I'm tracking down the latest failure in the 4.6
kernel, I'll update when I have more.
OK. I've narrowed this one down to some of the start-of-day code for the guest.
In particular, it's not always telling the HV the correct address of
startup_32; I think this manifests itself on large memory because of some
wraparound or something like that, but I haven't confirmed 100% yet.
Regardless, even with a fix like 5.0 has, I'm still having minor problems. The
patch should end up being fairly simple, I just have to work through the
remainder of the problem.
Created attachment 159394 [details]
Fix for the > 4GB issue
My last update was kind of correct, but now I have a much better idea about
what is going on now. Basically there are two bugs here:
1) We are not telling the hypervisor that it is allowed to put our pagetable
stuff over 4GB. I believe this is restricting the amount of low memory it has
available for this.
2) We are not correctly saving and restoring the entire cr3 value on task
switch. This causes some bits to be lost and bad things to happen.
Both of these problems should be fixed by the attached patch.
*** Bug 247545 has been marked as a duplicate of this bug. ***
committed in stream U6 build 55.23. A test kernel with this patch is available
(In reply to comment #17)
> committed in stream U6 build 55.23. A test kernel with this patch is available
> from http://people.redhat.com/~jbaron/rhel4/
That solved the problem!
Thanks a million!
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
*** Bug 246702 has been marked as a duplicate of this bug. ***