Bug 220592 - kernel boot panic with >=64GB memory
kernel boot panic with >=64GB memory
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.0
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Chris Lalancette
: OtherQA
: 223104 (view as bug list)
Depends On:
Blocks: 197865 200812 222082 223104 223985 227613 230117
  Show dependency treegraph
 
Reported: 2006-12-22 07:02 EST by Prasanna
Modified: 2009-06-19 06:51 EDT (History)
17 users (show)

See Also:
Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 14:17:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Screenshot of the kernel panic (80.65 KB, image/pjpeg)
2006-12-22 07:02 EST, Prasanna
no flags Details
Kernel panic log file..... (15.98 KB, text/plain)
2006-12-26 04:12 EST, Prasanna
no flags Details
This zip file contains screenshots of memory, cpu, process state of the working non-xen kernel. (153.28 KB, application/x-zip-compressed)
2006-12-26 04:17 EST, Prasanna
no flags Details
This is the dmesg log file of the working non-xen kernel. (30.21 KB, text/plain)
2006-12-26 04:25 EST, Prasanna
no flags Details
Output of sysreport command on a working non-xen kernel. (290.29 KB, application/octet-stream)
2006-12-27 03:54 EST, Prasanna
no flags Details
'xm dmesg' from a functioning 64G system (14.07 KB, text/plain)
2007-01-17 21:26 EST, David Aquilina
no flags Details
This is the dmesg_log file of the working boot with memory from the same cell (29.47 KB, text/plain)
2007-01-18 08:05 EST, Prasanna
no flags Details
Out put of xm dmesg command (11.78 KB, text/plain)
2007-01-19 00:50 EST, Prasanna
no flags Details
serial log with the no reboot option (10.43 KB, text/plain)
2007-01-19 02:03 EST, Prasanna
no flags Details
Patch posted to upstream to let the hypervisor/dom0 kernel boot on large memory machines (2.88 KB, patch)
2007-02-21 19:41 EST, Chris Lalancette
no flags Details | Diff
RHEL5 Patch: Move the frametablet to the top of memory on x86_64 (2.93 KB, patch)
2007-02-26 10:51 EST, Chris Lalancette
no flags Details | Diff
RHEL5 Patch: auto-limit dom0 memory allocation to 32G (1.81 KB, patch)
2007-02-26 10:54 EST, Chris Lalancette
no flags Details | Diff

  None (edit)
Description Prasanna 2006-12-22 07:02:39 EST
Description of problem:

OS : RHEL5-RC-snapshot5 (x86_64)
 
Kernel panic is seen during the reboot after the installtion.
This panic is only seen when we enter the Server+Virtualization serial number.
 
The normal install works fine.

Version-Release number of selected component (if applicable):
RHEL5-RC-snapshot5 (x86_64)

***Screenshot atttached****

How reproducible:
Always

Steps to Reproduce:
1.Start the installtion
2.Enter the (server+virtualization) serial number when prompted
3.Continue with the installtion
4.Click on reboot when prompted.
5.kernel panic seen during reboot

  
Actual results:
kernel panic seen during reboot

Expected results:
kernel should boot up.

Additional info:
Comment 1 Prasanna 2006-12-22 07:02:40 EST
Created attachment 144270 [details]
Screenshot of the kernel panic
Comment 2 Prasanna 2006-12-26 04:12:22 EST
Created attachment 144362 [details]
Kernel panic log file.....
Comment 3 Prasanna 2006-12-26 04:17:05 EST
Created attachment 144363 [details]
This zip file contains screenshots of memory, cpu, process state of the working non-xen kernel.
Comment 4 Prasanna 2006-12-26 04:25:41 EST
Created attachment 144364 [details]
This is the dmesg log file of the working non-xen kernel.
Comment 5 Prasanna 2006-12-27 03:54:27 EST
Created attachment 144387 [details]
Output of sysreport command on a working non-xen kernel.
Comment 6 Prasanna 2006-12-28 04:28:56 EST
If the following kernel parameter is added at the boot prompt, the kernel 
boots up without any errors.

dom0_mem=512M

Not sure if this is an expected behavior.
Comment 9 Rik van Riel 2007-01-17 20:05:26 EST
How high can the dom0_mem go without there being problems?

We have a few 8GB and 16GB x86-64 systems for Xen testing and have not noticed
such problems here.  Does dom0_mem=32000m (just as an example) give problems? 
What is the limit where things break?
Comment 12 Rik van Riel 2007-01-17 20:30:28 EST
One other piece of debugging info that would be useful to have:

1) boot the hypervisor with "noreboot" to disable rebooting

2) once it decides it cannot start dom0, please break into the hypervisor
console (ESC ESC ESC on serial console) and type "m" to display information on
the memory pools
Comment 14 David Aquilina 2007-01-17 21:23:27 EST
I've performed some tests tonight on the only 64G system I have available - a 4 node AMD Opteron 
system. RC 5 Snapshot 7 booted into dom0 properly without any parameters needed. 

Unisys, please perform some testing with different dom0-mem settings - how high can you push it until it 
breaks? 
Comment 15 David Aquilina 2007-01-17 21:26:09 EST
Created attachment 145875 [details]
'xm dmesg' from a functioning 64G system
Comment 16 Prasanna 2007-01-18 01:18:59 EST
One more important observation.
===============================
The kernel panic is seen when the memory from different cells are selected.
But everything works fine if the memory is from the same cell.

Comment 17 Prasanna 2007-01-18 01:21:55 EST
The same kernel panic is seen in snapshot7 also.
Comment 18 Prasanna 2007-01-18 05:47:11 EST
When dom0_mem is set to 138M and below the kernel panic is seen.

Everything works fine when dom0_mem is set to 139M and above.
Comment 19 Stephen Tweedie 2007-01-18 07:05:37 EST
"But everything works fine if the memory is from the same cell."  Is that with
exactly the same amount of visible memory?  Can you please attach a boot log
file from the working boot with memory on the same cell?  That will give us an
idea of which differences are causing the problem.
Comment 20 Prasanna 2007-01-18 08:05:13 EST
Created attachment 145914 [details]
This is the dmesg_log file of the working boot with memory from the same cell
Comment 22 David Aquilina 2007-01-18 10:36:32 EST
(In reply to comment #20)
> Created an attachment (id=145914) [edit]
> This is the dmesg_log file of the working boot with memory from the same cell

Can you please also attach the output of 'xm dmesg'? Thanks! 
Comment 23 Rik van Riel 2007-01-18 10:46:51 EST
OK, I think we have narrowed down the issue to Xen being unable to start up a
domain 0 that's too big.

Would it it be an acceptable workaround for Unisys if we simply limited the
maximum size of domain 0 to 16 or 32GB?   

After all, if somebody is going to use virtualization, they'll do so to carve
the machine up into multiple virtual machines anyway. The rest of the memory
will still be available for guests.
Comment 24 Bruce Vessey 2007-01-18 13:56:32 EST
I'll answer on behalf of Prasanna.  Limiting the size of dom0 like you propose
should be fine.  In fact, we're looking into whether we'll recommend to our
customers that they include a "dom0_mem=512M" directive anyway.  We're concerned
that if a huge amount of memory is initially assigned to dom0, and we then need
to balloon the memory down to reassign it to other domains, the time required to
do the ballooning may be excessive.
Comment 26 Prasanna 2007-01-19 00:50:17 EST
Created attachment 145966 [details]
Out put of xm dmesg command
Comment 27 Prasanna 2007-01-19 02:03:41 EST
Created attachment 145968 [details]
serial log with the no reboot option
Comment 28 Prasanna 2007-01-19 02:10:07 EST
<<<<<<<1) boot the hypervisor with "noreboot" to disable rebooting
<<<<<<<2) once it decides it cannot start dom0, please break into the 
<<<<<<<hypervisor console (ESC ESC ESC on serial console) and type "m" to 
<<<<<<<display information on
<<<<<<<the memory pools

The above file ser2.log is the log from the serial terminal after specifying 
the noreboot option.

I'm not able to break into the hyperviser console by presing ESC ESC ESC on 
the serial console...

The last thing I see is the following message and nothing happens after that 
(even after pressing ESC ESC ESC or m)

(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Not enough RAM for domain 0 allocation.
(XEN) ****************************************
(XEN)
(XEN) Manual reset required ('noreboot' specified)
Comment 29 Prasanna 2007-01-19 02:13:48 EST
<<<<<<<<<- When booting with a dom0_mem parameter, please ensure that you can 
<<<<<<<<<use the remainder of the memory in the system to create Xen guests. 

Yes we are able to create Xen guests with the remainder of the meory in the 
system.
Comment 32 Chris Lalancette 2007-01-22 16:01:42 EST
Can you please try booting with "mem=64G" (and possibly lower) on the hypervisor
command line?  This will limit, of course, the amount of memory the hypervisor
can hand out to guest domains, but I am curious as to whether this will actually
allow dom0 to boot (it helped in some other cases).

Thanks,
Chris Lalancette
Comment 33 Chris Lalancette 2007-01-22 16:13:44 EST
*** Bug 223104 has been marked as a duplicate of this bug. ***
Comment 34 Chris Lalancette 2007-01-22 17:13:08 EST
I've uploaded a test package here:

http://people.redhat.com/clalance/rpms/kernel-xen-2.6.18-4.el532gtest.x86_64.rpm

It has a patch that automatically limits the dom0 memory to 32G at boot time. 
Could everyone associated with this please give this kernel a whirl and see if
it helps out?  We are under a little time pressure, so the sooner we get test
results the better.

Thanks,
Chris Lalancette
Comment 36 Larry Newitt 2007-01-23 08:54:03 EST
Chris,

I tested with 256GB of memory.   Your test kernel failed to boot. I did not 
use any dom0_mem or mem parameters. I got the usualy panic, not enough memory 
for dom0 error.  However in looking at the memory available that was displayed 
during the boot, I am not sure the kernel actually restricted the memory.

I then tested using the xen 2.6.18-2 kernel which I received to correct a 
problem installing fully virtualized VMs.

Using the mem= parameter in combination with the dom0_mem parameter always set 
to 512M, I was able to boot the system with up to a mem= setting of 92GB.  Any 
higher settings and I got a kernel panic or the not enough memory for dom0 
error.  This is consistent with my prevous testing using physical memory 
because I have to increment the physical memory in 32GB amounts, so 64GB was 
OK and 96GB failed.

Comment 37 Chris Lalancette 2007-01-23 09:06:16 EST
Right, the kernel was going to restrict the memory available to dom0, not the
hypervisor (it was the equivalent of automatically doing dom0_mem=32G).  We were
working on the (mistaken) assumption that that would help; we were obviously
wrong :).  Thanks for the testing, though; we will have to do this another way.

Chris Lalancette
Comment 38 Tim Burke 2007-01-23 09:41:55 EST
The RHEL5 RC gameplan for this is to release note. Which is represented by
223985.   Hence this issue should not be a RHEL5RC blocker.

Removing the 5.0.0 flag and replacing it for this issue with the 5.1.0 flag.
Comment 43 Chris Lalancette 2007-01-23 11:55:04 EST
Can both IBM and Unisys confirm that booting these large memory machines with
"dom0_mem=512M mem=64G" works?  Assuming that this is the case, I will change
the associated release notes to match the above.

Thanks,
Chris Lalancette
Comment 46 Prasanna 2007-01-24 02:08:27 EST
This is to confirm that the Unisys system boots up with  "dom0_mem=512M 
mem=64G" and also the virtual machine creation succeeds.
Comment 49 Samuel Benjamin 2007-01-24 17:19:45 EST
Adding Dell to this big.
Comment 51 Chris Lalancette 2007-01-25 11:41:22 EST
Unisys,
     Can you do a couple of tests for us?

1)  boot one of the boxes with loads of memory with just "mem=64G"?
2)  boot one of the boxes with loads of memory with "mem=64G dom0_mem=4G"?

Some people have suggested that dom0_mem=512M is too low for such a large
system.  Assuming that one or both of the above work, we will change the release
note to suggest them.

Chris Lalancette
Comment 53 Larry Newitt 2007-01-28 16:11:05 EST
I tested with a system that had 128GB of memory. 

With mem=64G parameter and no dom0_mem paramter, the system would not boot, it 
got the "Panic on CPU 0:" error.  I found the largest value I could 
successfully use was mem=58G.


Using both mem=64G and dom0_mem=4G parameters, I was successfully able to boot.


Comment 58 Chris Lalancette 2007-02-08 17:12:37 EST
In response to riel's comment #12, and also for what xensource asked for, here
is the output from the hypervisor just before panicing (on a 96GB box):

(XEN) Physical memory information:
(XEN)     Xen heap: 11116kB free
(XEN)     DMA heap: 35864kB free
(XEN)     Dom heap: 99246024kB free
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Not enough RAM for domain 0 allocation.
(XEN) ****************************************

Chris Lalancette
Comment 59 Chris Lalancette 2007-02-21 19:39:54 EST
OK, I think I understand this enough now.  Here is what is happening:

The hypervisor starts up with 1GB in the DMA zone.  Two large allocations come
out of the DMA zone; the frame table (in init_frametable()), and the memory for
dom0 (in construct_dom0()).  With a lot of memory in the box, most of the DMA
zone gets allocated during init_frametable; so much so, in fact, that there is
no room to make the allocation in construct_dom0, and the dom0 fails to boot with:

(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Not enough RAM for domain 0 allocation.
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

The solution (suggested by Keir Fraser), is to make the frametable allocated out
of high memory instead of the DMA zone.  The patch here does this.  I tested
this out on a 96GB machine; without the patch, the machine would reboot as
described above; with the patch, I was able to boot dom0 and create a PV guest
with 92GB of memory.
     I only compile tested this on ia64, but I don't see anything in it that
should cause problems there.
     Note that an x86_64 RPM with this patch built in is here:

http://people.redhat.com/clalance/rpms/kernel-xen-2.6.18-8.el564Gpatch.x86_64.rpm

Could everyone interested in this bug please download this kernel, test it out,
and report back the results?

Thanks,
Chris Lalancette
Comment 60 Chris Lalancette 2007-02-21 19:41:47 EST
Created attachment 148554 [details]
Patch posted to upstream to let the hypervisor/dom0 kernel boot on large memory machines
Comment 70 Chris Lalancette 2007-02-26 10:51:13 EST
Created attachment 148800 [details]
RHEL5 Patch: Move the frametablet to the top of memory on x86_64

This is the RHEL5 version of the patch posted upstream.  It is functionally
equivalent as the last patch, but renames things to be more consistent with
upstream Xen.
Comment 71 Chris Lalancette 2007-02-26 10:54:09 EST
Created attachment 148801 [details]
RHEL5 Patch: auto-limit dom0 memory allocation to 32G

The RHEL5 patch to automatically limit dom0 to 32G, to avoid exhausting the DMA
zone.  Can be overridden by the knowledgeable user with dom0_mem=64G (or
similar).
Comment 72 Chris Lalancette 2007-02-26 11:04:34 EST
I've just uploaded the two patches we are proposing for RHEL5.  I also built a
kernel with these two patches here:

http://people.redhat.com/clalance/rpms/kernel-xen-2.6.18-9.el564Gpatch2.x86_64.rpm

Can everyone please download this kernel and make sure it does the right thing
for them?  It should boot on all memory size machines without too much issue,
but I'd like confirmation for that.

Chris Lalancette
Comment 74 Prasanna 2007-02-27 05:40:45 EST
Test results for kernel-xen-2.6.18-9.el564Gpatch2.x86_64
=======================================================
All the below scenarios pass without error/panic on ES7000 x64 machine

1) dom0_mem=512M mem=64G
2) dom0_mem=512M mem=93G
3) dom0_mem=512M mem=100G
4) dom0_mem=512M and no mem parameter
5) mem=100G and no dom0_mem parameter
5) no dom0_mem parameter and no mem parameter


Also tried to create a new virtual machine and found no problems.
Comment 75 Prasanna 2007-02-27 05:43:23 EST
All the above scenarios were tried on a a machine with 128 gig physical memory.
Comment 76 Chris Lalancette 2007-02-27 09:41:16 EST
Excellent!  Thanks for the testing, I appreciate it.
Comment 77 Jay Turner 2007-03-01 13:49:30 EST
Reluctant QE ack.  This issue doesn't fit the criteria for day0 advisory since
it was known before RC and we declared gold anyway.
Comment 82 Don Zickus 2007-03-27 16:47:06 EDT
in 2.6.18-12.el5
Comment 91 John Poelstra 2007-08-27 14:20:57 EDT
A fix for this issue should have been included in the packages contained in the
RHEL5.1-Snapshot3 on partners.redhat.com.  

Requested action: Please verify that your issue is fixed as soon as possible to
ensure that it is included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

More assistance: If you cannot access bugzilla, please reply with a message to
Issue Tracker and I will change the status for you.  If you need assistance
accessing ftp://partners.redhat.com, please contact your Partner Manager.
Comment 92 Prasanna 2007-08-28 02:01:29 EDT
The below scenarios pass without error/panic on ES7000 x86_64 machine for RHEL 
5.1 snapshot 3.

1) no dom0_mem parameter and no mem parameter
2) dom0_mem=512M mem=64G


Comment 94 errata-xmlrpc 2007-11-07 14:17:56 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.