Bug 220592
|
Description
Prasanna
2006-12-22 12:02:39 UTC
Created attachment 144270 [details]
Screenshot of the kernel panic
Created attachment 144362 [details]
Kernel panic log file.....
Created attachment 144363 [details]
This zip file contains screenshots of memory, cpu, process state of the working non-xen kernel.
Created attachment 144364 [details]
This is the dmesg log file of the working non-xen kernel.
Created attachment 144387 [details]
Output of sysreport command on a working non-xen kernel.
If the following kernel parameter is added at the boot prompt, the kernel boots up without any errors. dom0_mem=512M Not sure if this is an expected behavior. How high can the dom0_mem go without there being problems? We have a few 8GB and 16GB x86-64 systems for Xen testing and have not noticed such problems here. Does dom0_mem=32000m (just as an example) give problems? What is the limit where things break? One other piece of debugging info that would be useful to have: 1) boot the hypervisor with "noreboot" to disable rebooting 2) once it decides it cannot start dom0, please break into the hypervisor console (ESC ESC ESC on serial console) and type "m" to display information on the memory pools I've performed some tests tonight on the only 64G system I have available - a 4 node AMD Opteron system. RC 5 Snapshot 7 booted into dom0 properly without any parameters needed. Unisys, please perform some testing with different dom0-mem settings - how high can you push it until it breaks? Created attachment 145875 [details]
'xm dmesg' from a functioning 64G system
One more important observation. =============================== The kernel panic is seen when the memory from different cells are selected. But everything works fine if the memory is from the same cell. The same kernel panic is seen in snapshot7 also. When dom0_mem is set to 138M and below the kernel panic is seen. Everything works fine when dom0_mem is set to 139M and above. "But everything works fine if the memory is from the same cell." Is that with exactly the same amount of visible memory? Can you please attach a boot log file from the working boot with memory on the same cell? That will give us an idea of which differences are causing the problem. Created attachment 145914 [details]
This is the dmesg_log file of the working boot with memory from the same cell
(In reply to comment #20) > Created an attachment (id=145914) [edit] > This is the dmesg_log file of the working boot with memory from the same cell Can you please also attach the output of 'xm dmesg'? Thanks! OK, I think we have narrowed down the issue to Xen being unable to start up a domain 0 that's too big. Would it it be an acceptable workaround for Unisys if we simply limited the maximum size of domain 0 to 16 or 32GB? After all, if somebody is going to use virtualization, they'll do so to carve the machine up into multiple virtual machines anyway. The rest of the memory will still be available for guests. I'll answer on behalf of Prasanna. Limiting the size of dom0 like you propose should be fine. In fact, we're looking into whether we'll recommend to our customers that they include a "dom0_mem=512M" directive anyway. We're concerned that if a huge amount of memory is initially assigned to dom0, and we then need to balloon the memory down to reassign it to other domains, the time required to do the ballooning may be excessive. Created attachment 145966 [details]
Out put of xm dmesg command
Created attachment 145968 [details]
serial log with the no reboot option
<<<<<<<1) boot the hypervisor with "noreboot" to disable rebooting
<<<<<<<2) once it decides it cannot start dom0, please break into the
<<<<<<<hypervisor console (ESC ESC ESC on serial console) and type "m" to
<<<<<<<display information on
<<<<<<<the memory pools
The above file ser2.log is the log from the serial terminal after specifying
the noreboot option.
I'm not able to break into the hyperviser console by presing ESC ESC ESC on
the serial console...
The last thing I see is the following message and nothing happens after that
(even after pressing ESC ESC ESC or m)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Not enough RAM for domain 0 allocation.
(XEN) ****************************************
(XEN)
(XEN) Manual reset required ('noreboot' specified)
<<<<<<<<<- When booting with a dom0_mem parameter, please ensure that you can <<<<<<<<<use the remainder of the memory in the system to create Xen guests. Yes we are able to create Xen guests with the remainder of the meory in the system. Can you please try booting with "mem=64G" (and possibly lower) on the hypervisor command line? This will limit, of course, the amount of memory the hypervisor can hand out to guest domains, but I am curious as to whether this will actually allow dom0 to boot (it helped in some other cases). Thanks, Chris Lalancette *** Bug 223104 has been marked as a duplicate of this bug. *** I've uploaded a test package here: http://people.redhat.com/clalance/rpms/kernel-xen-2.6.18-4.el532gtest.x86_64.rpm It has a patch that automatically limits the dom0 memory to 32G at boot time. Could everyone associated with this please give this kernel a whirl and see if it helps out? We are under a little time pressure, so the sooner we get test results the better. Thanks, Chris Lalancette Chris, I tested with 256GB of memory. Your test kernel failed to boot. I did not use any dom0_mem or mem parameters. I got the usualy panic, not enough memory for dom0 error. However in looking at the memory available that was displayed during the boot, I am not sure the kernel actually restricted the memory. I then tested using the xen 2.6.18-2 kernel which I received to correct a problem installing fully virtualized VMs. Using the mem= parameter in combination with the dom0_mem parameter always set to 512M, I was able to boot the system with up to a mem= setting of 92GB. Any higher settings and I got a kernel panic or the not enough memory for dom0 error. This is consistent with my prevous testing using physical memory because I have to increment the physical memory in 32GB amounts, so 64GB was OK and 96GB failed. Right, the kernel was going to restrict the memory available to dom0, not the hypervisor (it was the equivalent of automatically doing dom0_mem=32G). We were working on the (mistaken) assumption that that would help; we were obviously wrong :). Thanks for the testing, though; we will have to do this another way. Chris Lalancette The RHEL5 RC gameplan for this is to release note. Which is represented by 223985. Hence this issue should not be a RHEL5RC blocker. Removing the 5.0.0 flag and replacing it for this issue with the 5.1.0 flag. Can both IBM and Unisys confirm that booting these large memory machines with "dom0_mem=512M mem=64G" works? Assuming that this is the case, I will change the associated release notes to match the above. Thanks, Chris Lalancette This is to confirm that the Unisys system boots up with "dom0_mem=512M mem=64G" and also the virtual machine creation succeeds. Adding Dell to this big. Unisys,
Can you do a couple of tests for us?
1) boot one of the boxes with loads of memory with just "mem=64G"?
2) boot one of the boxes with loads of memory with "mem=64G dom0_mem=4G"?
Some people have suggested that dom0_mem=512M is too low for such a large
system. Assuming that one or both of the above work, we will change the release
note to suggest them.
Chris Lalancette
I tested with a system that had 128GB of memory. With mem=64G parameter and no dom0_mem paramter, the system would not boot, it got the "Panic on CPU 0:" error. I found the largest value I could successfully use was mem=58G. Using both mem=64G and dom0_mem=4G parameters, I was successfully able to boot. In response to riel's comment #12, and also for what xensource asked for, here is the output from the hypervisor just before panicing (on a 96GB box): (XEN) Physical memory information: (XEN) Xen heap: 11116kB free (XEN) DMA heap: 35864kB free (XEN) Dom heap: 99246024kB free (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) Not enough RAM for domain 0 allocation. (XEN) **************************************** Chris Lalancette OK, I think I understand this enough now. Here is what is happening:
The hypervisor starts up with 1GB in the DMA zone. Two large allocations come
out of the DMA zone; the frame table (in init_frametable()), and the memory for
dom0 (in construct_dom0()). With a lot of memory in the box, most of the DMA
zone gets allocated during init_frametable; so much so, in fact, that there is
no room to make the allocation in construct_dom0, and the dom0 fails to boot with:
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Not enough RAM for domain 0 allocation.
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
The solution (suggested by Keir Fraser), is to make the frametable allocated out
of high memory instead of the DMA zone. The patch here does this. I tested
this out on a 96GB machine; without the patch, the machine would reboot as
described above; with the patch, I was able to boot dom0 and create a PV guest
with 92GB of memory.
I only compile tested this on ia64, but I don't see anything in it that
should cause problems there.
Note that an x86_64 RPM with this patch built in is here:
http://people.redhat.com/clalance/rpms/kernel-xen-2.6.18-8.el564Gpatch.x86_64.rpm
Could everyone interested in this bug please download this kernel, test it out,
and report back the results?
Thanks,
Chris Lalancette
Created attachment 148554 [details]
Patch posted to upstream to let the hypervisor/dom0 kernel boot on large memory machines
Created attachment 148800 [details]
RHEL5 Patch: Move the frametablet to the top of memory on x86_64
This is the RHEL5 version of the patch posted upstream. It is functionally
equivalent as the last patch, but renames things to be more consistent with
upstream Xen.
Created attachment 148801 [details]
RHEL5 Patch: auto-limit dom0 memory allocation to 32G
The RHEL5 patch to automatically limit dom0 to 32G, to avoid exhausting the DMA
zone. Can be overridden by the knowledgeable user with dom0_mem=64G (or
similar).
I've just uploaded the two patches we are proposing for RHEL5. I also built a kernel with these two patches here: http://people.redhat.com/clalance/rpms/kernel-xen-2.6.18-9.el564Gpatch2.x86_64.rpm Can everyone please download this kernel and make sure it does the right thing for them? It should boot on all memory size machines without too much issue, but I'd like confirmation for that. Chris Lalancette Test results for kernel-xen-2.6.18-9.el564Gpatch2.x86_64 ======================================================= All the below scenarios pass without error/panic on ES7000 x64 machine 1) dom0_mem=512M mem=64G 2) dom0_mem=512M mem=93G 3) dom0_mem=512M mem=100G 4) dom0_mem=512M and no mem parameter 5) mem=100G and no dom0_mem parameter 5) no dom0_mem parameter and no mem parameter Also tried to create a new virtual machine and found no problems. All the above scenarios were tried on a a machine with 128 gig physical memory. Excellent! Thanks for the testing, I appreciate it. Reluctant QE ack. This issue doesn't fit the criteria for day0 advisory since it was known before RC and we declared gold anyway. in 2.6.18-12.el5 A fix for this issue should have been included in the packages contained in the RHEL5.1-Snapshot3 on partners.redhat.com. Requested action: Please verify that your issue is fixed as soon as possible to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. More assistance: If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. The below scenarios pass without error/panic on ES7000 x86_64 machine for RHEL 5.1 snapshot 3. 1) no dom0_mem parameter and no mem parameter 2) dom0_mem=512M mem=64G An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html |