We have observed this crash with 2.6.9-55.ELxenU: <pre> kernel BUG at arch/i386/mm/hypervisor.c:390! invalid operand: 0000 [#1] SMP Modules linked in: md5 ipv6 autofs4 sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables loop xennet dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod xenblk sd_mod scsi_mod CPU: 0 EIP: 0061:[<c0115453>] Not tainted VLI EFLAGS: 00010096 (2.6.9-55.ELxenU) EIP is at xen_destroy_contiguous_region+0x232/0x2eb eax: ffffffff ebx: 00000006 ecx: c1aa6ef0 edx: 00000000 esi: 00000000 edi: ec8cd000 ebp: 0002c8cd esp: c1aa6edc ds: 007b es: 007b ss: 0068 Process events/0 (pid: 6, threadinfo=c1aa6000 task=c1ac5160) Stack: 00000000 00000000 00000000 00000000 0002c8cd c1aa6eec 00000001 00000000 00000000 00007ff0 00000001 c19fdd80 ec7f6000 c19fdd80 ec84b6c0 ec8cd000 00000001 c0141150 ec8cd000 00000000 00000000 c19fde40 c19fdd80 ec84b6c0 Call Trace: [<c0141150>] slab_destroy+0x3c/0x8e [<c0142911>] cache_reap+0x14b/0x1aa [<c012a95f>] worker_thread+0x170/0x1de [<c01427c6>] cache_reap+0x0/0x1aa [<c0117461>] default_wake_function+0x0/0x12 [<c0117461>] default_wake_function+0x0/0x12 [<c012a7ef>] worker_thread+0x0/0x1de [<c012e683>] kthread+0x7c/0xa6 [<c012e607>] kthread+0x0/0xa6 [<c0105341>] kernel_thread_helper+0x5/0xb Code: 7c 24 48 8b 44 24 48 bb 06 00 00 00 8d 4c 24 14 8b 54 24 0c 05 00 00 00 40 c1 e8 0c 8d 2c 10 89 6c 24 10 e8 30 bd fe ff 48 74 08 <0f> 0b 86 01 93 2d 27 c0 8b 44 24 10 31 f6 89 fb 8b 0d 2c 98 29 <0>Fatal exception: panic in 5 seconds Kernel panic - not syncing: Fatal exception </pre> This corresponds to the call to XENMEM_populate_physmap in xen_destroy_contiguous_region() which unfortunately can fail if the guest has reached it's allocation or some other failure to allocate memory occurs. If this call fails we BUG because we cannot get the original memory back. Upstream we have fixed this by introducing the XENMEM_memory_exchange hypercall which gives you back the original allocation on failure. The upstream patch to use this is http://xenbits.xensource.com/xen-unstable.hg?rev/10361. The hypervisor end is http://xenbits.xensource.com/xen-unstable.hg?rev/10360
Created attachment 288721 [details] xen-unstable 10353:bd1a0b2bb2d4 ported to linux-2.6.9-67.EL
Created attachment 288731 [details] xen-unstable 10361:2ac74e1df3d7 ported to 2.6.9-67.EL
We recently stopped using the rhel4x.hg port from xenbits and switched to using a set of targetted fixes to your kernels. I have attached the patches from our queue relevant to this issue.
Could you provide a test case that causes this failure? The slightly scary part: the hypervisor end pt'd to by http://xenbits.xensource.com/xen-unstable.hg?rev/10360 are not the same as in rhel5. (a) shadow changes not in rhel5, but that's ok, shadow isn't used (b) calls like 'guest_handle_add_offset()' are in the hg's memory_exchange() fcn, but not in rhel5's. Can you confirm that rhel5's implementation of memory_exchange() is sufficient to support this fix?
It looks as if your rhel5 hypervisor has http://xenbits.xensource.com/xen-unstable.hg?rev/12360 in addition to 10360 which explains the differences (your basic hypervisor version seems to be based on 15042). The test case is to ensure that host memory is very low, for example by starting a second domain in addition to the domain under test which uses all remaining host memory or ballooning domain 0 to cause this to happen (verified with "xm info" -> free_memory). Once you are in this state a few live migrations should be enough to trigger the problem.
Setdev ack for Chris Lalancette.
Created attachment 295742 [details] Combined patch, rebased against the latest RHEL-4 HEAD This is just a combined patch for the two previous patches that Ian uploaded, rebased against the current RHEL-4 CVS HEAD. I'm still testing it. Chris Lalancette
Created attachment 295745 [details] New version of the patch, including batched hypercalls A new version of the patch against RHEL-4 CVS HEAD. This version includes the stuff from the previous rebased patch, plus has batched hypercalls, and changes us from having separate arch/i386/mm/hypervisor.c and arch/x86_64/mm/hypervisor.c to having a single one in i386 which the x86_64 one links to. Chris Lalancette
I also caught the bug, after applied the patch, looks like work fine to me will include the patch in next release kernel?
Well, the thing was, I was never able to reproduce the bug myself, so we decided not to put the patches in unless/until we got a reproducer. Do you have a reproducer I could use to prove that the patch makes a difference? Chris Lalancette
We have reproducer. But is it is complicated. - Install 2 servers on Dell 2850 8 GB machine. dom0 has 512 MB memory - On each start 5.4 GB guest, roughly 2 GB for two guest. - start el5 64 bit guest and el4u7 32 bit guest. - migrate them with ssl for about 5 to 10 times. You will get crash all the time, withing 5 to 10 migration. Most likely before 5 migrations. - With patched kernel, I migrated 50 times and did not see any crash. BTW: our hypervisor's version is 3.1.4 for x86_64, Domain0 is 32bit, hypervisor based Oracle VM server. If you want to reproduce, we can help you. Thanks, Joe
Ah, OK, great. Actually, it's not strictly necessary for me to reproduce it; just the fact that we have a reporter who can reproduce and confirm the fix should be sufficient to get it into the tree. I'll work on getting this into our RHEL-4 tree; once I have some test packages, I'll pass them over to you for testing. Thanks for the information! Chris Lalancette
OK. I've cleaned up the patch a bit (I'll attach it), and done some very basic testing that seems to work OK. I've uploaded the test kernels to http://people.redhat.com/clalance/bz249867. Could you download these and give them a whirl to make sure that they still fix your problem? Thanks, Chris Lalancette
Created attachment 328115 [details] Patch to fix the PV BUG in low memory condition
I have tested the same test case where I was able to reproduce this bug. I did not hit the same issue with patched kernel provided by RedHat in above comment. I can successfully reproduce the same crash on 2.6.9-78.0.5.0.1.ELxenU kernel within 30 minutes or so. With Patched kernel 2.6.9-78.23.ELmemex5xenU I am not able to reproduce this crash after a day and half or so. (Must have done couple of hundred migration back and forth). Still test is going on without any crash. This patch fixes issue mentioned in this bug.
Deepak, thanks for testing! Chris, will you included the patch in next release?
Deepak, Joe, Excellent, thanks for all of the testing. That's exactly what we needed. Assuming there are no regressions found in internal QA, this patch should go into the next release. Chris Lalancette
Committed in 78.26.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Patch is in -89.EL kernel.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html