Bug 233543
| Summary: | Random panics running as a paravirtualized guest of RHEL 5.0 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Mark Plaksin <happy> | ||||
| Component: | kernel-xen | Assignee: | Chris Lalancette <clalance> | ||||
| Status: | CLOSED ERRATA | QA Contact: | |||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.5 | CC: | clalance, daniel.fosselius, ddutile, larsaj, xen-maint | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | RHBA-2007-0791 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2007-11-15 16:22:45 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 234251 | ||||||
| Attachments: |
|
||||||
Please provide following info: -- xen config file (from /etc/xen/) -- /var/log/xen/xend.log -- /var/log/xen/xend-debug.log 32-bit guest on 32-bit hypervisor/kernel or 64-bit guest on 64-bit hypervisor/kernel total memory in system (config file should show guest mem allocation) cat /proc/cpuinfo TIA... Don We moved on long ago. When this happened I talked to Red Hat support and they said "probably fixed in the soon-to-be-released 4.5 but you can't have that to test it out." So I gave up. I'd resolve the bug but I'm not sure what status is appropriate. Please leave the bug open; I think I now have a fix, and I will need it for tracking. Thanks! Chris Lalancette I believe we need a combination of this c/s: http://xenbits.xensource.com/xen-unstable.hg?cs=c6efd6c2feaa Along with fixing up "xen_pfn_to_cr3" in drivers/xen/core/smpboot.c to fix this properly. Chris Lalancette This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. OK. I was able to figure out how to reliably reproduce it here: 1) 16GB box 2) Create one guest that is large (say, 7200MB) 3) "xm info | grep free_memory" 4) Create a second guest that is exactly the size from the last command 5) OOPs! Now that I can do it reliably, I'll try out a few things to see whether this is truly fixed already. Chris Lalancette Tested so far on RHEL 5.0 dom0, xm info reports 15359MB total memory, dom0 clamped to 512MB of memory: a) 2 RHEL 4.5 guests, one 7200MB, the second 7524MB to make sure free_memory == 0. Result: first domain starts properly, second panic's with stack trace from earlier in this BZ. b) 1 RHEL 4.5, 1 4.6 guest, same sizes as above. Result: first one starts properly, second panic's when trying to execve() init. c) 2 RHEL 5 guests, same sizes as above. Result: both boot OK. So it seems while we are coming up towards limits in the HV, this may still be a problem with the RHEL-4 kernel. I'm tracking down the latest failure in the 4.6 kernel, I'll update when I have more. Chris Lalancette OK. I've narrowed this one down to some of the start-of-day code for the guest. In particular, it's not always telling the HV the correct address of startup_32; I think this manifests itself on large memory because of some wraparound or something like that, but I haven't confirmed 100% yet. Regardless, even with a fix like 5.0 has, I'm still having minor problems. The patch should end up being fairly simple, I just have to work through the remainder of the problem. Chris Lalancette Created attachment 159394 [details]
Fix for the > 4GB issue
My last update was kind of correct, but now I have a much better idea about
what is going on now. Basically there are two bugs here:
1) We are not telling the hypervisor that it is allowed to put our pagetable
stuff over 4GB. I believe this is restricting the amount of low memory it has
available for this.
2) We are not correctly saving and restoring the entire cr3 value on task
switch. This causes some bits to be lost and bad things to happen.
Both of these problems should be fixed by the attached patch.
Chris Lalancette
*** Bug 247545 has been marked as a duplicate of this bug. *** committed in stream U6 build 55.23. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ (In reply to comment #17) > committed in stream U6 build 55.23. A test kernel with this patch is available > from http://people.redhat.com/~jbaron/rhel4/ > That solved the problem! Thanks a million! An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html *** Bug 246702 has been marked as a duplicate of this bug. *** |
Description of problem: We see random kernel panics after installing RHEL 4.5 as a paravirtualized guest of RHEL 5.0. The RHEL 5.0 kernel is 2.6.18-8.1.1.el5xen. The RHEL 5.0 system is up to date as of yesterday 3/22/07. Version-Release number of selected component (if applicable): The RHEL 4.5 kernel is 2.6.9-48.ELxenU. How reproducible: We can't reliably reproduce it yet. It has happened during the first reboot the installer does. It has also happened after the system has been up for a little while (10s of minutes at least). We also have guests on which it has not happened yet. Steps to Reproduce: 1. 2. 3. Actual results: Here's the output from one panic. It's what was on the screen after we ran 'xm create -c rhel45_1'. We didn't have the serial console set up so, presumably there's a lot missing between "No controller found" and "cut here". The serial console is set up now so hopefully we'll get another panic and be able to provide more details. tc: IRQ 8 is not free. i8042.c: No controller found. ------------[ cut here ]------------ kernel BUG at arch/i386/mm/pgtable-xen.c:306! invalid operand: 0000 [#1] SMP Modules linked in: dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod xenblk sd_mod scsi_mod CPU: 0 EIP: 0061:[<c011163a>] Not tainted VLI EFLAGS: 00010282 (2.6.9-48.ELxenU) EIP is at pgd_ctor+0x1d/0x26 eax: fffffff4 ebx: 00000000 ecx: f5392000 edx: 00000000 esi: c2202d80 edi: ed785860 ebp: 00000001 esp: ec4cfde4 ds: 007b es: 007b ss: 0068 Process hotplug (pid: 465, threadinfo=ec4cf000 task=ec4d57f0) Stack: c0141a69 ecbe0000 c2202d80 00000001 ecbe0000 ed785860 c2202d80 c2202e40 c0141beb c2202d80 ed785860 00000001 c2202d80 ed785860 ecbe0000 00000010 00000001 000000d0 c2229080 0000000c c2202e08 c2202d80 c0141dda c2202d80 Call Trace: [<c0141a69>] cache_init_objs+0x35/0x56 [<c0141beb>] cache_grow+0xfb/0x187 [<c0141dda>] cache_alloc_refill+0x163/0x19c [<c0141ff5>] kmem_cache_alloc+0x67/0x97 [<c0111671>] pgd_alloc+0x17/0x336 [<c01199d4>] mm_init+0xd7/0x116 [<c01199e4>] mm_init+0xe7/0x116 [<c0119a3d>] mm_alloc+0x2a/0x31 [<c0162ae9>] do_execve+0x82/0x210 [<c0105d79>] sys_execve+0x2c/0x8e [<c010737f>] syscall_call+0x7/0xb Code: 74 02 66 a5 a8 01 74 01 a4 5e 5b 5e 5f c3 80 3d 04 f7 2e c0 00 75 1c 6a 20 6a 00 ff 74 24 0c e8 ce 37 00 00 83 c4 0c 85 c0 74 08 <0f> 0b 32 01 8e 2a 2 7 c0 c3 80 3d 04 f7 2e c0 00 75 0d c7 44 24 <0>Fatal exception: panic in 5 seconds Expected results: Additional info: