Bug 425471 - [RHEL5.2]: Under load, an i386 PV guest on i386 HV will hang during save/restore
[RHEL5.2]: Under load, an i386 PV guest on i386 HV will hang during save/restore
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.2
All Linux
low Severity low
: rc
: ---
Assigned To: Don Dutile
Martin Jenner
:
Depends On:
Blocks: 431081
  Show dependency treegraph
 
Reported: 2007-12-14 16:24 EST by Chris Lalancette
Modified: 2008-05-21 11:04 EDT (History)
3 users (show)

See Also:
Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 11:04:20 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Full stack trace when the crash occurs (3.68 KB, text/plain)
2007-12-14 16:24 EST, Chris Lalancette
no flags Details
Another similar stack trace (6.79 KB, text/plain)
2008-01-18 11:37 EST, Ian Campbell
no flags Details
vcpu-init stack overflow patch (609 bytes, text/plain)
2008-01-30 15:46 EST, Don Dutile
no flags Details
Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions. (4.14 KB, text/plain)
2008-01-30 17:09 EST, Don Dutile
no flags Details
oops... last patch had incorrect memset; caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-) (4.17 KB, text/x-patch)
2008-01-30 17:25 EST, Don Dutile
no flags Details
a fix to the last fix! (4.24 KB, text/plain)
2008-01-31 10:05 EST, Don Dutile
no flags Details

  None (edit)
Description Chris Lalancette 2007-12-14 16:24:26 EST
Description of problem:
My test environment is the following:

dom0: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 8GB memory, 4 physical CPUs, AMD Rev. F
domU: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 1GB memory, 4 vCPUs

Inside the domU, I run a kernel compile in a loop with "make -j5".
On the dom0, I run an "xm save <dom> /var/lib/xen/save/<dom>-save ; xm restore
/var/lib/xen/save/<dom>-save" in a loop.

After about 100 iterations of the above save/restore loop, the save will hang,
with the following messages in the domU console:

do_IRQ: stack overflow: 488
 [<c0406e3c>] <1>BUG: unable to handle kernel paging request at virtual address
35416001

(full stack trace is attached).  Note that with the same load, I do not see this
problem on x86_64.
Comment 1 Chris Lalancette 2007-12-14 16:24:26 EST
Created attachment 289491 [details]
Full stack trace when the crash occurs
Comment 2 Ian Campbell 2008-01-18 11:36:31 EST
I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs
which we have reported against 5.1) when doing migration under load.

Unfortunately either the dump_trace() code is buggy or our attempt to call it
causes us to really overflow so our backtrace is less useful than yours. I will
attach it anyway.
Comment 3 Ian Campbell 2008-01-18 11:37:30 EST
Created attachment 292173 [details]
Another similar stack trace
Comment 4 Don Dutile 2008-01-30 15:46:05 EST
Created attachment 293483 [details]
vcpu-init stack overflow patch
Comment 5 Don Dutile 2008-01-30 15:47:57 EST
Ian,

Can you give the patch in #4 a try in your test scenario.
Seems to work for our test case (so far; test look exceeding 140 now; will let
it continue to run overnight...)

- Don
Comment 6 Chris Lalancette 2008-01-30 16:35:38 EST
The patch attached here looks like it will work, because __cpu_up() is
serialized in the common kernel/cpu.c code.  That being said, it looks like
upstream LKML xen went a different way, and just used some kzalloc() memory in
arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c).  There's no
real changeset to point to; it was merged into Linus's tree that way.  Anyway,
just thought I'd point this out, since this way has some additional upstream
testing behind it.

Chris Lalancette
Comment 7 Don Dutile 2008-01-30 17:09:11 EST
Created attachment 293499 [details]
Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions.
Comment 8 Don Dutile 2008-01-30 17:25:51 EST
Created attachment 293504 [details]
oops... last patch had incorrect memset;  caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-)
Comment 9 Ian Campbell 2008-01-31 09:20:07 EST
Latest patch crashes on boot for me.

I thought at first it was because:
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt));
should be
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt));

But changing that didn't fix it.

Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is
unecessary since this will be clear as part of the memset over the entire ctxt.
Shouldn't jurt to do it though.
Comment 10 Don Dutile 2008-01-31 10:05:50 EST
Created attachment 293587 [details]
a fix to the last fix!

Ian,

yes, the hypervisor call bugs out.... and this piece of code isn't much
better either:
	 kmalloc(sizeof(*ctxt), GFP_KERNEL);  <-- ctxt isn't set equal to
return!

turns out i did an overnight test w/the static, and that ran for over 600
iterations, so the concept of reducing stack size definitely solves the
problem.
Now, to implement the patch correctly! ;-)

So, sorry for the late day mess up. 
Try this updated patch.  if that runs long today, I'll post it on xen-devel,
as well as dupe it for rhel4.

- Don
Comment 11 Bill Burns 2008-01-31 13:28:34 EST
Setting flags and assigning.
Comment 13 Don Dutile 2008-01-31 15:46:17 EST
Patch in #10 posted.
Comment 16 Don Zickus 2008-02-06 15:55:57 EST
in 2.6.18-78.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 19 errata-xmlrpc 2008-05-21 11:04:20 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Note You need to log in before you can comment on or make changes to this bug.