Bug 425471 - [RHEL5.2]: Under load, an i386 PV guest on i386 HV will hang during save/restore
Summary: [RHEL5.2]: Under load, an i386 PV guest on i386 HV will hang during save/restore
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Don Dutile (Red Hat)
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 431081
TreeView+ depends on / blocked
 
Reported: 2007-12-14 21:24 UTC by Chris Lalancette
Modified: 2008-05-21 15:04 UTC (History)
3 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 15:04:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Full stack trace when the crash occurs (3.68 KB, text/plain)
2007-12-14 21:24 UTC, Chris Lalancette
no flags Details
Another similar stack trace (6.79 KB, text/plain)
2008-01-18 16:37 UTC, Ian Campbell
no flags Details
vcpu-init stack overflow patch (609 bytes, text/plain)
2008-01-30 20:46 UTC, Don Dutile (Red Hat)
no flags Details
Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions. (4.14 KB, text/plain)
2008-01-30 22:09 UTC, Don Dutile (Red Hat)
no flags Details
oops... last patch had incorrect memset; caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-) (4.17 KB, text/x-patch)
2008-01-30 22:25 UTC, Don Dutile (Red Hat)
no flags Details
a fix to the last fix! (4.24 KB, text/plain)
2008-01-31 15:05 UTC, Don Dutile (Red Hat)
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description Chris Lalancette 2007-12-14 21:24:26 UTC
Description of problem:
My test environment is the following:

dom0: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 8GB memory, 4 physical CPUs, AMD Rev. F
domU: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 1GB memory, 4 vCPUs

Inside the domU, I run a kernel compile in a loop with "make -j5".
On the dom0, I run an "xm save <dom> /var/lib/xen/save/<dom>-save ; xm restore
/var/lib/xen/save/<dom>-save" in a loop.

After about 100 iterations of the above save/restore loop, the save will hang,
with the following messages in the domU console:

do_IRQ: stack overflow: 488
 [<c0406e3c>] <1>BUG: unable to handle kernel paging request at virtual address
35416001

(full stack trace is attached).  Note that with the same load, I do not see this
problem on x86_64.

Comment 1 Chris Lalancette 2007-12-14 21:24:26 UTC
Created attachment 289491 [details]
Full stack trace when the crash occurs

Comment 2 Ian Campbell 2008-01-18 16:36:31 UTC
I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs
which we have reported against 5.1) when doing migration under load.

Unfortunately either the dump_trace() code is buggy or our attempt to call it
causes us to really overflow so our backtrace is less useful than yours. I will
attach it anyway.

Comment 3 Ian Campbell 2008-01-18 16:37:30 UTC
Created attachment 292173 [details]
Another similar stack trace

Comment 4 Don Dutile (Red Hat) 2008-01-30 20:46:05 UTC
Created attachment 293483 [details]
vcpu-init stack overflow patch

Comment 5 Don Dutile (Red Hat) 2008-01-30 20:47:57 UTC
Ian,

Can you give the patch in #4 a try in your test scenario.
Seems to work for our test case (so far; test look exceeding 140 now; will let
it continue to run overnight...)

- Don

Comment 6 Chris Lalancette 2008-01-30 21:35:38 UTC
The patch attached here looks like it will work, because __cpu_up() is
serialized in the common kernel/cpu.c code.  That being said, it looks like
upstream LKML xen went a different way, and just used some kzalloc() memory in
arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c).  There's no
real changeset to point to; it was merged into Linus's tree that way.  Anyway,
just thought I'd point this out, since this way has some additional upstream
testing behind it.

Chris Lalancette

Comment 7 Don Dutile (Red Hat) 2008-01-30 22:09:11 UTC
Created attachment 293499 [details]
Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions.

Comment 8 Don Dutile (Red Hat) 2008-01-30 22:25:51 UTC
Created attachment 293504 [details]
oops... last patch had incorrect memset;  caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-)

Comment 9 Ian Campbell 2008-01-31 14:20:07 UTC
Latest patch crashes on boot for me.

I thought at first it was because:
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt));
should be
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt));

But changing that didn't fix it.

Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is
unecessary since this will be clear as part of the memset over the entire ctxt.
Shouldn't jurt to do it though.

Comment 10 Don Dutile (Red Hat) 2008-01-31 15:05:50 UTC
Created attachment 293587 [details]
a fix to the last fix!

Ian,

yes, the hypervisor call bugs out.... and this piece of code isn't much
better either:
	 kmalloc(sizeof(*ctxt), GFP_KERNEL);  <-- ctxt isn't set equal to
return!

turns out i did an overnight test w/the static, and that ran for over 600
iterations, so the concept of reducing stack size definitely solves the
problem.
Now, to implement the patch correctly! ;-)

So, sorry for the late day mess up. 
Try this updated patch.  if that runs long today, I'll post it on xen-devel,
as well as dupe it for rhel4.

- Don

Comment 11 Bill Burns 2008-01-31 18:28:34 UTC
Setting flags and assigning.


Comment 13 Don Dutile (Red Hat) 2008-01-31 20:46:17 UTC
Patch in #10 posted.

Comment 16 Don Zickus 2008-02-06 20:55:57 UTC
in 2.6.18-78.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 19 errata-xmlrpc 2008-05-21 15:04:20 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html



Note You need to log in before you can comment on or make changes to this bug.