Bug 249867 - Kernel can BUG() in low memory conditions
Kernel can BUG() in low memory conditions
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel-xen (Show other bugs)
4.5
All Linux
medium Severity medium
: ---
: ---
Assigned To: Chris Lalancette
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-07-27 11:30 EDT by Ian Campbell
Modified: 2009-05-18 15:27 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-05-18 15:27:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
xen-unstable 10353:bd1a0b2bb2d4 ported to linux-2.6.9-67.EL (983 bytes, patch)
2007-12-14 03:57 EST, Ian Campbell
no flags Details | Diff
xen-unstable 10361:2ac74e1df3d7 ported to 2.6.9-67.EL (8.44 KB, patch)
2007-12-14 03:58 EST, Ian Campbell
no flags Details | Diff
Combined patch, rebased against the latest RHEL-4 HEAD (8.18 KB, patch)
2008-02-24 11:57 EST, Chris Lalancette
no flags Details | Diff
New version of the patch, including batched hypercalls (10.33 KB, patch)
2008-02-24 13:09 EST, Chris Lalancette
no flags Details | Diff
Patch to fix the PV BUG in low memory condition (21.73 KB, patch)
2009-01-03 14:32 EST, Chris Lalancette
no flags Details | Diff

  None (edit)
Description Ian Campbell 2007-07-27 11:30:00 EDT
We have observed this crash with 2.6.9-55.ELxenU:
<pre>
kernel BUG at arch/i386/mm/hypervisor.c:390!
invalid operand: 0000 [#1]
SMP
Modules linked in: md5 ipv6 autofs4 sunrpc ipt_REJECT ipt_state ip_conntrack
iptable_filter ip_tables loop xennet dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod xenblk sd_mod scsi_mod
CPU:    0
EIP:    0061:[<c0115453>]    Not tainted VLI
EFLAGS: 00010096   (2.6.9-55.ELxenU)
EIP is at xen_destroy_contiguous_region+0x232/0x2eb
eax: ffffffff   ebx: 00000006   ecx: c1aa6ef0   edx: 00000000
esi: 00000000   edi: ec8cd000   ebp: 0002c8cd   esp: c1aa6edc
ds: 007b   es: 007b   ss: 0068
Process events/0 (pid: 6, threadinfo=c1aa6000 task=c1ac5160)
Stack: 00000000 00000000 00000000 00000000 0002c8cd c1aa6eec 00000001 00000000
       00000000 00007ff0 00000001 c19fdd80 ec7f6000 c19fdd80 ec84b6c0 ec8cd000
       00000001 c0141150 ec8cd000 00000000 00000000 c19fde40 c19fdd80 ec84b6c0
Call Trace:
 [<c0141150>] slab_destroy+0x3c/0x8e
 [<c0142911>] cache_reap+0x14b/0x1aa
 [<c012a95f>] worker_thread+0x170/0x1de
 [<c01427c6>] cache_reap+0x0/0x1aa
 [<c0117461>] default_wake_function+0x0/0x12
 [<c0117461>] default_wake_function+0x0/0x12
 [<c012a7ef>] worker_thread+0x0/0x1de
 [<c012e683>] kthread+0x7c/0xa6
 [<c012e607>] kthread+0x0/0xa6
 [<c0105341>] kernel_thread_helper+0x5/0xb
Code: 7c 24 48 8b 44 24 48 bb 06 00 00 00 8d 4c 24 14 8b 54 24 0c 05 00 00 00
40 c1 e8 0c 8d 2c 10 89 6c 24 10 e8 30 bd fe ff 48 74 08 <0f> 0b 86 01 93 2d 27
c0 8b 44 24 10 31 f6 89 fb 8b 0d 2c 98 29
 <0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception
</pre>

This corresponds to the call to XENMEM_populate_physmap in
xen_destroy_contiguous_region() which unfortunately can fail if the guest has
reached it's allocation or some other failure to allocate memory occurs. If this
call fails we BUG because we cannot get the original memory back.

Upstream we have fixed this by introducing the XENMEM_memory_exchange hypercall
which gives you back the original allocation on failure. The upstream patch to
use this is http://xenbits.xensource.com/xen-unstable.hg?rev/10361.

The hypervisor end is http://xenbits.xensource.com/xen-unstable.hg?rev/10360
Comment 1 Ian Campbell 2007-12-14 03:57:29 EST
Created attachment 288721 [details]
xen-unstable 10353:bd1a0b2bb2d4 ported to linux-2.6.9-67.EL
Comment 2 Ian Campbell 2007-12-14 03:58:19 EST
Created attachment 288731 [details]
xen-unstable 10361:2ac74e1df3d7 ported to 2.6.9-67.EL
Comment 3 Ian Campbell 2007-12-14 03:59:29 EST
We recently stopped using the rhel4x.hg port from xenbits and switched to using
a set of targetted fixes to your kernels. I have attached the patches from our
queue relevant to this issue.
Comment 4 Don Dutile 2007-12-14 14:45:52 EST
Could you provide a test case that causes this failure?

The slightly scary part: the hypervisor end pt'd to by 
http://xenbits.xensource.com/xen-unstable.hg?rev/10360
are not the same as in rhel5.
(a) shadow changes not in rhel5, but that's ok, shadow isn't used
(b) calls like 'guest_handle_add_offset()' are in
    the hg's memory_exchange() fcn, but not in rhel5's.

Can you confirm that rhel5's implementation of memory_exchange() is
sufficient to support this fix?
Comment 5 Ian Campbell 2007-12-17 07:07:02 EST
It looks as if your rhel5 hypervisor has
http://xenbits.xensource.com/xen-unstable.hg?rev/12360 in addition to 10360
which explains the differences (your basic hypervisor version seems to be based
on 15042).

The test case is to ensure that host memory is very low, for example by starting
a second domain in addition to the domain under test which uses all remaining
host memory or ballooning domain 0 to cause this to happen (verified with "xm
info" -> free_memory). Once you are in this state a few live migrations should
be enough to trigger the problem.
Comment 6 Bill Burns 2008-01-05 09:10:52 EST
Setdev ack for Chris Lalancette.
Comment 7 Chris Lalancette 2008-02-24 11:57:49 EST
Created attachment 295742 [details]
Combined patch, rebased against the latest RHEL-4 HEAD

This is just a combined patch for the two previous patches that Ian uploaded,
rebased against the current RHEL-4 CVS HEAD.  I'm still testing it.

Chris Lalancette
Comment 8 Chris Lalancette 2008-02-24 13:09:11 EST
Created attachment 295745 [details]
New version of the patch, including batched hypercalls

A new version of the patch against RHEL-4 CVS HEAD.  This version includes the
stuff from the previous rebased patch, plus has batched hypercalls, and changes
us from having separate arch/i386/mm/hypervisor.c and
arch/x86_64/mm/hypervisor.c to having a single one in i386 which the x86_64 one
links to.

Chris Lalancette
Comment 10 Joe Jin 2008-12-30 19:27:29 EST
I also caught the bug, after applied the patch, looks like work fine to me
will include the patch in next release kernel?
Comment 11 Chris Lalancette 2008-12-31 06:22:08 EST
Well, the thing was, I was never able to reproduce the bug myself, so we decided not to put the patches in unless/until we got a reproducer.  Do you have a reproducer I could use to prove that the patch makes a difference?

Chris Lalancette
Comment 12 Joe Jin 2009-01-02 18:18:17 EST
We have reproducer. But is it is complicated.

- Install 2 servers on Dell 2850 8 GB machine. dom0 has 512 MB memory
- On each start 5.4 GB guest, roughly 2 GB for two guest.
- start el5 64 bit guest and el4u7 32 bit guest.
- migrate them with ssl for about 5 to 10 times. You will get crash all the time, withing 
  5 to 10 migration. Most likely before 5 migrations.
- With patched kernel, I migrated 50 times and did not see any crash.

BTW: our hypervisor's version is 3.1.4 for x86_64, Domain0 is 32bit, hypervisor based
Oracle VM server.

If you want to reproduce, we can help you.

Thanks,
Joe
Comment 13 Chris Lalancette 2009-01-03 03:41:12 EST
Ah, OK, great.  Actually, it's not strictly necessary for me to reproduce it; just the fact that we have a reporter who can reproduce and confirm the fix should be sufficient to get it into the tree.  I'll work on getting this into our RHEL-4 tree; once I have some test packages, I'll pass them over to you for testing.  Thanks for the information!

Chris Lalancette
Comment 14 Chris Lalancette 2009-01-03 14:27:46 EST
OK.  I've cleaned up the patch a bit (I'll attach it), and done some very basic testing that seems to work OK.  I've uploaded the test kernels to http://people.redhat.com/clalance/bz249867.  Could you download these and give them a whirl to make sure that they still fix your problem?

Thanks,
Chris Lalancette
Comment 15 Chris Lalancette 2009-01-03 14:32:35 EST
Created attachment 328115 [details]
Patch to fix the PV BUG in low memory condition
Comment 16 Deepak Patel 2009-01-05 19:39:01 EST
I have tested the same test case where I was able to reproduce this bug. I did not hit the same issue with patched kernel provided by RedHat in above comment.

I can successfully reproduce the same crash on 2.6.9-78.0.5.0.1.ELxenU kernel within 30 minutes or so.

With Patched kernel 2.6.9-78.23.ELmemex5xenU I am not able to reproduce this crash after a day and half or so. (Must have done couple of hundred migration back and forth). Still test is going on without any crash. This patch fixes issue mentioned in this bug.
Comment 17 Joe Jin 2009-01-05 20:02:08 EST
Deepak, thanks for testing!
Chris, will you included the patch in next release?
Comment 18 Chris Lalancette 2009-01-06 02:38:00 EST
Deepak, Joe,
    Excellent, thanks for all of the testing.  That's exactly what we needed.  Assuming there are no regressions found in internal QA, this patch should go into the next release.

Chris Lalancette
Comment 19 Vivek Goyal 2009-01-09 08:54:50 EST
Committed in 78.26.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Comment 21 Jan Tluka 2009-05-05 11:52:51 EDT
Patch is in -89.EL kernel.
Comment 23 errata-xmlrpc 2009-05-18 15:27:54 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Note You need to log in before you can comment on or make changes to this bug.