Red Hat Bugzilla – Bug 249867
Kernel can BUG() in low memory conditions
Last modified: 2009-05-18 15:27:54 EDT
We have observed this crash with 2.6.9-55.ELxenU:
kernel BUG at arch/i386/mm/hypervisor.c:390!
invalid operand: 0000 [#1]
Modules linked in: md5 ipv6 autofs4 sunrpc ipt_REJECT ipt_state ip_conntrack
iptable_filter ip_tables loop xennet dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod xenblk sd_mod scsi_mod
EIP: 0061:[<c0115453>] Not tainted VLI
EFLAGS: 00010096 (2.6.9-55.ELxenU)
EIP is at xen_destroy_contiguous_region+0x232/0x2eb
eax: ffffffff ebx: 00000006 ecx: c1aa6ef0 edx: 00000000
esi: 00000000 edi: ec8cd000 ebp: 0002c8cd esp: c1aa6edc
ds: 007b es: 007b ss: 0068
Process events/0 (pid: 6, threadinfo=c1aa6000 task=c1ac5160)
Stack: 00000000 00000000 00000000 00000000 0002c8cd c1aa6eec 00000001 00000000
00000000 00007ff0 00000001 c19fdd80 ec7f6000 c19fdd80 ec84b6c0 ec8cd000
00000001 c0141150 ec8cd000 00000000 00000000 c19fde40 c19fdd80 ec84b6c0
Code: 7c 24 48 8b 44 24 48 bb 06 00 00 00 8d 4c 24 14 8b 54 24 0c 05 00 00 00
40 c1 e8 0c 8d 2c 10 89 6c 24 10 e8 30 bd fe ff 48 74 08 <0f> 0b 86 01 93 2d 27
c0 8b 44 24 10 31 f6 89 fb 8b 0d 2c 98 29
<0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception
This corresponds to the call to XENMEM_populate_physmap in
xen_destroy_contiguous_region() which unfortunately can fail if the guest has
reached it's allocation or some other failure to allocate memory occurs. If this
call fails we BUG because we cannot get the original memory back.
Upstream we have fixed this by introducing the XENMEM_memory_exchange hypercall
which gives you back the original allocation on failure. The upstream patch to
use this is http://xenbits.xensource.com/xen-unstable.hg?rev/10361.
The hypervisor end is http://xenbits.xensource.com/xen-unstable.hg?rev/10360
Created attachment 288721 [details]
xen-unstable 10353:bd1a0b2bb2d4 ported to linux-2.6.9-67.EL
Created attachment 288731 [details]
xen-unstable 10361:2ac74e1df3d7 ported to 2.6.9-67.EL
We recently stopped using the rhel4x.hg port from xenbits and switched to using
a set of targetted fixes to your kernels. I have attached the patches from our
queue relevant to this issue.
Could you provide a test case that causes this failure?
The slightly scary part: the hypervisor end pt'd to by
are not the same as in rhel5.
(a) shadow changes not in rhel5, but that's ok, shadow isn't used
(b) calls like 'guest_handle_add_offset()' are in
the hg's memory_exchange() fcn, but not in rhel5's.
Can you confirm that rhel5's implementation of memory_exchange() is
sufficient to support this fix?
It looks as if your rhel5 hypervisor has
http://xenbits.xensource.com/xen-unstable.hg?rev/12360 in addition to 10360
which explains the differences (your basic hypervisor version seems to be based
The test case is to ensure that host memory is very low, for example by starting
a second domain in addition to the domain under test which uses all remaining
host memory or ballooning domain 0 to cause this to happen (verified with "xm
info" -> free_memory). Once you are in this state a few live migrations should
be enough to trigger the problem.
Setdev ack for Chris Lalancette.
Created attachment 295742 [details]
Combined patch, rebased against the latest RHEL-4 HEAD
This is just a combined patch for the two previous patches that Ian uploaded,
rebased against the current RHEL-4 CVS HEAD. I'm still testing it.
Created attachment 295745 [details]
New version of the patch, including batched hypercalls
A new version of the patch against RHEL-4 CVS HEAD. This version includes the
stuff from the previous rebased patch, plus has batched hypercalls, and changes
us from having separate arch/i386/mm/hypervisor.c and
arch/x86_64/mm/hypervisor.c to having a single one in i386 which the x86_64 one
I also caught the bug, after applied the patch, looks like work fine to me
will include the patch in next release kernel?
Well, the thing was, I was never able to reproduce the bug myself, so we decided not to put the patches in unless/until we got a reproducer. Do you have a reproducer I could use to prove that the patch makes a difference?
We have reproducer. But is it is complicated.
- Install 2 servers on Dell 2850 8 GB machine. dom0 has 512 MB memory
- On each start 5.4 GB guest, roughly 2 GB for two guest.
- start el5 64 bit guest and el4u7 32 bit guest.
- migrate them with ssl for about 5 to 10 times. You will get crash all the time, withing
5 to 10 migration. Most likely before 5 migrations.
- With patched kernel, I migrated 50 times and did not see any crash.
BTW: our hypervisor's version is 3.1.4 for x86_64, Domain0 is 32bit, hypervisor based
Oracle VM server.
If you want to reproduce, we can help you.
Ah, OK, great. Actually, it's not strictly necessary for me to reproduce it; just the fact that we have a reporter who can reproduce and confirm the fix should be sufficient to get it into the tree. I'll work on getting this into our RHEL-4 tree; once I have some test packages, I'll pass them over to you for testing. Thanks for the information!
OK. I've cleaned up the patch a bit (I'll attach it), and done some very basic testing that seems to work OK. I've uploaded the test kernels to http://people.redhat.com/clalance/bz249867. Could you download these and give them a whirl to make sure that they still fix your problem?
Created attachment 328115 [details]
Patch to fix the PV BUG in low memory condition
I have tested the same test case where I was able to reproduce this bug. I did not hit the same issue with patched kernel provided by RedHat in above comment.
I can successfully reproduce the same crash on 2.6.9-18.104.22.168.1.ELxenU kernel within 30 minutes or so.
With Patched kernel 2.6.9-78.23.ELmemex5xenU I am not able to reproduce this crash after a day and half or so. (Must have done couple of hundred migration back and forth). Still test is going on without any crash. This patch fixes issue mentioned in this bug.
Deepak, thanks for testing!
Chris, will you included the patch in next release?
Excellent, thanks for all of the testing. That's exactly what we needed. Assuming there are no regressions found in internal QA, this patch should go into the next release.
Committed in 78.26.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Patch is in -89.EL kernel.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.