Bug 249867
Summary: | Kernel can BUG() in low memory conditions | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Ian Campbell <ijc> |
Component: | kernel-xen | Assignee: | Chris Lalancette <clalance> |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.5 | CC: | deepak.patel, joe.jin, jtluka, qcai, xen-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-05-18 19:27:54 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Ian Campbell
2007-07-27 15:30:00 UTC
Created attachment 288721 [details]
xen-unstable 10353:bd1a0b2bb2d4 ported to linux-2.6.9-67.EL
Created attachment 288731 [details]
xen-unstable 10361:2ac74e1df3d7 ported to 2.6.9-67.EL
We recently stopped using the rhel4x.hg port from xenbits and switched to using a set of targetted fixes to your kernels. I have attached the patches from our queue relevant to this issue. Could you provide a test case that causes this failure? The slightly scary part: the hypervisor end pt'd to by http://xenbits.xensource.com/xen-unstable.hg?rev/10360 are not the same as in rhel5. (a) shadow changes not in rhel5, but that's ok, shadow isn't used (b) calls like 'guest_handle_add_offset()' are in the hg's memory_exchange() fcn, but not in rhel5's. Can you confirm that rhel5's implementation of memory_exchange() is sufficient to support this fix? It looks as if your rhel5 hypervisor has http://xenbits.xensource.com/xen-unstable.hg?rev/12360 in addition to 10360 which explains the differences (your basic hypervisor version seems to be based on 15042). The test case is to ensure that host memory is very low, for example by starting a second domain in addition to the domain under test which uses all remaining host memory or ballooning domain 0 to cause this to happen (verified with "xm info" -> free_memory). Once you are in this state a few live migrations should be enough to trigger the problem. Setdev ack for Chris Lalancette. Created attachment 295742 [details]
Combined patch, rebased against the latest RHEL-4 HEAD
This is just a combined patch for the two previous patches that Ian uploaded,
rebased against the current RHEL-4 CVS HEAD. I'm still testing it.
Chris Lalancette
Created attachment 295745 [details]
New version of the patch, including batched hypercalls
A new version of the patch against RHEL-4 CVS HEAD. This version includes the
stuff from the previous rebased patch, plus has batched hypercalls, and changes
us from having separate arch/i386/mm/hypervisor.c and
arch/x86_64/mm/hypervisor.c to having a single one in i386 which the x86_64 one
links to.
Chris Lalancette
I also caught the bug, after applied the patch, looks like work fine to me will include the patch in next release kernel? Well, the thing was, I was never able to reproduce the bug myself, so we decided not to put the patches in unless/until we got a reproducer. Do you have a reproducer I could use to prove that the patch makes a difference? Chris Lalancette We have reproducer. But is it is complicated. - Install 2 servers on Dell 2850 8 GB machine. dom0 has 512 MB memory - On each start 5.4 GB guest, roughly 2 GB for two guest. - start el5 64 bit guest and el4u7 32 bit guest. - migrate them with ssl for about 5 to 10 times. You will get crash all the time, withing 5 to 10 migration. Most likely before 5 migrations. - With patched kernel, I migrated 50 times and did not see any crash. BTW: our hypervisor's version is 3.1.4 for x86_64, Domain0 is 32bit, hypervisor based Oracle VM server. If you want to reproduce, we can help you. Thanks, Joe Ah, OK, great. Actually, it's not strictly necessary for me to reproduce it; just the fact that we have a reporter who can reproduce and confirm the fix should be sufficient to get it into the tree. I'll work on getting this into our RHEL-4 tree; once I have some test packages, I'll pass them over to you for testing. Thanks for the information! Chris Lalancette OK. I've cleaned up the patch a bit (I'll attach it), and done some very basic testing that seems to work OK. I've uploaded the test kernels to http://people.redhat.com/clalance/bz249867. Could you download these and give them a whirl to make sure that they still fix your problem? Thanks, Chris Lalancette Created attachment 328115 [details]
Patch to fix the PV BUG in low memory condition
I have tested the same test case where I was able to reproduce this bug. I did not hit the same issue with patched kernel provided by RedHat in above comment. I can successfully reproduce the same crash on 2.6.9-78.0.5.0.1.ELxenU kernel within 30 minutes or so. With Patched kernel 2.6.9-78.23.ELmemex5xenU I am not able to reproduce this crash after a day and half or so. (Must have done couple of hundred migration back and forth). Still test is going on without any crash. This patch fixes issue mentioned in this bug. Deepak, thanks for testing! Chris, will you included the patch in next release? Deepak, Joe, Excellent, thanks for all of the testing. That's exactly what we needed. Assuming there are no regressions found in internal QA, this patch should go into the next release. Chris Lalancette Committed in 78.26.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ Patch is in -89.EL kernel. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html |