Bug 233543

Summary: Random panics running as a paravirtualized guest of RHEL 5.0
Product: Red Hat Enterprise Linux 4 Reporter: Mark Plaksin <happy>
Component: kernel-xenAssignee: Chris Lalancette <clalance>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: clalance, daniel.fosselius, ddutile, larsaj, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0791 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-15 16:22:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 234251    
Attachments:
Description Flags
Fix for the > 4GB issue none

Description Mark Plaksin 2007-03-23 01:12:53 UTC
Description of problem:
We see random kernel panics after installing RHEL 4.5 as a paravirtualized guest
of RHEL 5.0.  The RHEL 5.0 kernel is 2.6.18-8.1.1.el5xen.  The RHEL 5.0 system
is up to date as of yesterday 3/22/07.

Version-Release number of selected component (if applicable):
The RHEL 4.5 kernel is 2.6.9-48.ELxenU.

How reproducible:
We can't reliably reproduce it yet.  It has happened during the first reboot the
installer does.  It has also happened after the system has been up for a little
while (10s of minutes at least).  We also have guests on which it has not
happened yet.

Steps to Reproduce:
1.
2.
3.
  
Actual results:
Here's the output from one panic.  It's what was on the screen after we ran 'xm
create -c rhel45_1'.  We didn't have the serial console set up so, presumably
there's a lot missing between "No controller found" and "cut here".  The serial
console is set up now so hopefully we'll get another panic and be able to
provide more details.

tc: IRQ 8 is not free.
i8042.c: No controller found.
------------[ cut here ]------------
kernel BUG at arch/i386/mm/pgtable-xen.c:306!
invalid operand: 0000 [#1]
SMP
Modules linked in: dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod xenblk sd_mod
scsi_mod
CPU:    0
EIP:    0061:[<c011163a>]    Not tainted VLI
EFLAGS: 00010282   (2.6.9-48.ELxenU)
EIP is at pgd_ctor+0x1d/0x26
eax: fffffff4   ebx: 00000000   ecx: f5392000   edx: 00000000
esi: c2202d80   edi: ed785860   ebp: 00000001   esp: ec4cfde4
ds: 007b   es: 007b   ss: 0068
Process hotplug (pid: 465, threadinfo=ec4cf000 task=ec4d57f0)
Stack: c0141a69 ecbe0000 c2202d80 00000001 ecbe0000 ed785860 c2202d80 c2202e40
       c0141beb c2202d80 ed785860 00000001 c2202d80 ed785860 ecbe0000 00000010
       00000001 000000d0 c2229080 0000000c c2202e08 c2202d80 c0141dda c2202d80
Call Trace:        
 [<c0141a69>] cache_init_objs+0x35/0x56
 [<c0141beb>] cache_grow+0xfb/0x187
 [<c0141dda>] cache_alloc_refill+0x163/0x19c
 [<c0141ff5>] kmem_cache_alloc+0x67/0x97
 [<c0111671>] pgd_alloc+0x17/0x336
 [<c01199d4>] mm_init+0xd7/0x116
 [<c01199e4>] mm_init+0xe7/0x116
 [<c0119a3d>] mm_alloc+0x2a/0x31
 [<c0162ae9>] do_execve+0x82/0x210
 [<c0105d79>] sys_execve+0x2c/0x8e
 [<c010737f>] syscall_call+0x7/0xb
Code: 74 02 66 a5 a8 01 74 01 a4 5e 5b 5e 5f c3 80 3d 04 f7 2e c0 00 75 1c 6a 20
6a 00 ff 74 24 0c e8 ce 37 00 00 83 c4 0c 85 c0 74 08 <0f> 0b 32 01 8e 2a 2
7 c0 c3 80 3d 04 f7 2e c0 00 75 0d c7 44 24                                    
                                                                            
 <0>Fatal exception: panic in 5 seconds


Expected results:


Additional info:

Comment 1 Don Dutile (Red Hat) 2007-06-20 14:52:05 UTC
Please provide following info:
-- xen config file (from /etc/xen/)
-- /var/log/xen/xend.log
-- /var/log/xen/xend-debug.log

32-bit guest on 32-bit hypervisor/kernel
or
64-bit guest on 64-bit hypervisor/kernel

total memory in system (config file should show guest mem allocation)

cat /proc/cpuinfo

TIA... Don

Comment 2 Mark Plaksin 2007-06-20 16:15:59 UTC
We moved on long ago.

When this happened I talked to Red Hat support and they said "probably fixed in
the soon-to-be-released 4.5 but you can't have that to test it out."  So I gave up.

I'd resolve the bug but I'm not sure what status is appropriate.


Comment 3 Chris Lalancette 2007-06-20 16:55:04 UTC
Please leave the bug open; I think I now have a fix, and I will need it for
tracking.

Thanks!
Chris Lalancette

Comment 4 Chris Lalancette 2007-06-20 17:45:56 UTC
I believe we need a combination of this c/s:

http://xenbits.xensource.com/xen-unstable.hg?cs=c6efd6c2feaa

Along with fixing up "xen_pfn_to_cr3" in drivers/xen/core/smpboot.c to fix this
properly.

Chris Lalancette

Comment 8 RHEL Program Management 2007-07-10 17:24:24 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Chris Lalancette 2007-07-10 19:10:40 UTC
OK.  I was able to figure out how to reliably reproduce it here:

1)  16GB box
2)  Create one guest that is large (say, 7200MB)
3)  "xm info | grep free_memory"
4)  Create a second guest that is exactly the size from the last command
5)  OOPs!

Now that I can do it reliably, I'll try out a few things to see whether this is
truly fixed already.

Chris Lalancette

Comment 10 Chris Lalancette 2007-07-10 21:39:31 UTC
Tested so far on RHEL 5.0 dom0, xm info reports 15359MB total memory, dom0
clamped to 512MB of memory:

a)  2 RHEL 4.5 guests, one 7200MB, the second 7524MB to make sure free_memory ==
0.  Result: first domain starts properly, second panic's with stack trace from
earlier in this BZ.

b)  1 RHEL 4.5, 1 4.6 guest, same sizes as above.  Result: first one starts
properly, second panic's when trying to execve() init.

c)  2 RHEL 5 guests, same sizes as above.  Result: both boot OK.

So it seems while we are coming up towards limits in the HV, this may still be a
problem with the RHEL-4 kernel.  I'm tracking down the latest failure in the 4.6
kernel, I'll update when I have more.

Chris Lalancette

Comment 11 Chris Lalancette 2007-07-13 21:16:19 UTC
OK.  I've narrowed this one down to some of the start-of-day code for the guest.
 In particular, it's not always telling the HV the correct address of
startup_32; I think this manifests itself on large memory because of some
wraparound or something like that, but I haven't confirmed 100% yet. 
Regardless, even with a fix like 5.0 has, I'm still having minor problems.  The
patch should end up being fairly simple, I just have to work through the
remainder of the problem.

Chris Lalancette


Comment 12 Chris Lalancette 2007-07-17 01:18:03 UTC
Created attachment 159394 [details]
Fix for the > 4GB issue

My last update was kind of correct, but now I have a much better idea about
what is going on now.  Basically there are two bugs here:

1)  We are not telling the hypervisor that it is allowed to put our pagetable
stuff over 4GB.  I believe this is restricting the amount of low memory it has
available for this.

2)  We are not correctly saving and restoring the entire cr3 value on task
switch.  This causes some bits to be lost and bad things to happen.

Both of these problems should be fixed by the attached patch.

Chris Lalancette

Comment 16 Chris Lalancette 2007-07-25 17:47:40 UTC
*** Bug 247545 has been marked as a duplicate of this bug. ***

Comment 17 Jason Baron 2007-07-26 14:19:27 UTC
committed in stream U6 build 55.23. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 22 Lars Jonsson 2007-08-06 13:02:28 UTC
(In reply to comment #17)
> committed in stream U6 build 55.23. A test kernel with this patch is available
> from http://people.redhat.com/~jbaron/rhel4/
> 

That solved the problem! 
Thanks a million!

Comment 28 errata-xmlrpc 2007-11-15 16:22:45 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html


Comment 33 Chris Lalancette 2008-02-27 02:55:18 UTC
*** Bug 246702 has been marked as a duplicate of this bug. ***