Bug 431081 - [RHEL4.6]: Under load, an i386 PV guest on i386 HV will hang during save/restore
[RHEL4.6]: Under load, an i386 PV guest on i386 HV will hang during save/restore
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel-xen (Show other bugs)
4.6
i386 Linux
high Severity high
: beta
: ---
Assigned To: Don Dutile
Martin Jenner
: Reopened
Depends On: 425471
Blocks:
  Show dependency treegraph
 
Reported: 2008-01-31 11:58 EST by Don Dutile
Modified: 2008-07-24 15:26 EDT (History)
3 users (show)

See Also:
Fixed In Version: RHSA-2008-0665
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-24 15:26:12 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Posted attachment for 4.7 inclusion (4.08 KB, text/plain)
2008-03-12 13:21 EDT, Don Dutile
no flags Details

  None (edit)
Description Don Dutile 2008-01-31 11:58:09 EST
+++ This bug was initially created as a clone of Bug #425471 +++

Description of problem:
My test environment is the following:

dom0: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 8GB memory, 4 physical CPUs, AMD Rev. F
domU: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 1GB memory, 4 vCPUs

Inside the domU, I run a kernel compile in a loop with "make -j5".
On the dom0, I run an "xm save <dom> /var/lib/xen/save/<dom>-save ; xm restore
/var/lib/xen/save/<dom>-save" in a loop.

After about 100 iterations of the above save/restore loop, the save will hang,
with the following messages in the domU console:

do_IRQ: stack overflow: 488
 [<c0406e3c>] <1>BUG: unable to handle kernel paging request at virtual address
35416001

(full stack trace is attached).  Note that with the same load, I do not see this
problem on x86_64.

-- Additional comment from clalance@redhat.com on 2007-12-14 16:24 EST --
Created an attachment (id=289491)
Full stack trace when the crash occurs


-- Additional comment from ijc@hellion.org.uk on 2008-01-18 11:36 EST --
I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs
which we have reported against 5.1) when doing migration under load.

Unfortunately either the dump_trace() code is buggy or our attempt to call it
causes us to really overflow so our backtrace is less useful than yours. I will
attach it anyway.

-- Additional comment from ijc@hellion.org.uk on 2008-01-18 11:37 EST --
Created an attachment (id=292173)
Another similar stack trace


-- Additional comment from ddutile@redhat.com on 2008-01-30 15:46 EST --
Created an attachment (id=293483)
vcpu-init stack overflow patch


-- Additional comment from ddutile@redhat.com on 2008-01-30 15:47 EST --
Ian,

Can you give the patch in #4 a try in your test scenario.
Seems to work for our test case (so far; test look exceeding 140 now; will let
it continue to run overnight...)

- Don

-- Additional comment from clalance@redhat.com on 2008-01-30 16:35 EST --
The patch attached here looks like it will work, because __cpu_up() is
serialized in the common kernel/cpu.c code.  That being said, it looks like
upstream LKML xen went a different way, and just used some kzalloc() memory in
arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c).  There's no
real changeset to point to; it was merged into Linus's tree that way.  Anyway,
just thought I'd point this out, since this way has some additional upstream
testing behind it.

Chris Lalancette

-- Additional comment from ddutile@redhat.com on 2008-01-30 17:09 EST --
Created an attachment (id=293499)
Updated/new patch based on LKML version; note: used kmalloc & memset so it
works on older linux versions.


-- Additional comment from ddutile@redhat.com on 2008-01-30 17:25 EST --
Created an attachment (id=293504)
oops... last patch had incorrect memset;  caught it before starting overnight
test..... nothing like exchanging one stack overflow for a stack trashing! ;-)


-- Additional comment from ijc@hellion.org.uk on 2008-01-31 09:20 EST --
Latest patch crashes on boot for me.

I thought at first it was because:
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt));
should be
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt));

But changing that didn't fix it.

Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is
unecessary since this will be clear as part of the memset over the entire ctxt.
Shouldn't jurt to do it though.

-- Additional comment from ddutile@redhat.com on 2008-01-31 10:05 EST --
Created an attachment (id=293587)
a fix to the last fix!

Ian,

yes, the hypervisor call bugs out.... and this piece of code isn't much
better either:
	 kmalloc(sizeof(*ctxt), GFP_KERNEL);  <-- ctxt isn't set equal to
return!

turns out i did an overnight test w/the static, and that ran for over 600
iterations, so the concept of reducing stack size definitely solves the
problem.
Now, to implement the patch correctly! ;-)

So, sorry for the late day mess up. 
Try this updated patch.  if that runs long today, I'll post it on xen-devel,
as well as dupe it for rhel4.

- Don
Comment 2 RHEL Product and Program Management 2008-02-25 14:35:07 EST
Product Management has reviewed and declined this request.  You may appeal this
decision by reopening this request. 
Comment 3 Chris Lalancette 2008-02-25 14:43:08 EST
Sigh.  The stupid bot strikes again.  Re-opening.

Chris Lalancette
Comment 4 Kevin Krafthefer 2008-02-25 16:18:21 EST
why is the bug set to priority "low"? 
Comment 5 Don Dutile 2008-02-25 17:09:32 EST
Changed priority to 'high';

Failure to save/restore also means failure to migrate.
Migration is supported and key feature of xen-based guests.

This fix is in rhel5.2, and brings rhel4-xenU to equiv., supported, functionality.

This test case is a good load mixture (disk IO) and is not a contrived scenario.
Comment 6 Ege Turgay 2008-03-07 04:57:01 EST
We've found that the migration feature works alright with 64 bit RHEL4.

It is clearly working with 32bit RHEL4 but not migrating, for the current
circumtances, our problem is solved with installing 64bit RHEL4 VM.

Comment 7 Chris Lalancette 2008-03-07 07:46:25 EST
Yeah, this bug strictly effects 32-bit guests, so moving to 64-bit would
alleviate this.

Chris Lalancette
Comment 8 Don Dutile 2008-03-12 13:21:00 EDT
Created attachment 297808 [details]
Posted attachment for 4.7 inclusion

Posted patch for rhel4.7 inclusion.
Comment 9 Bill Burns 2008-03-12 13:54:22 EDT
Set dev ack for Don.
Comment 11 Vivek Goyal 2008-03-18 12:01:25 EDT
Committed in 68.22.EL. Released in 68.23.EL. RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Comment 14 errata-xmlrpc 2008-07-24 15:26:12 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Note You need to log in before you can comment on or make changes to this bug.