Bug 431081

Summary: [RHEL4.6]: Under load, an i386 PV guest on i386 HV will hang during save/restore
Product: Red Hat Enterprise Linux 4 Reporter: Don Dutile (Red Hat) <ddutile>
Component: kernel-xenAssignee: Don Dutile (Red Hat) <ddutile>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: ddutile, ijc, xen-maint
Target Milestone: betaKeywords: Reopened
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2008-0665 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-24 19:26:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 425471    
Bug Blocks:    
Attachments:
Description Flags
Posted attachment for 4.7 inclusion none

Description Don Dutile (Red Hat) 2008-01-31 16:58:09 UTC
+++ This bug was initially created as a clone of Bug #425471 +++

Description of problem:
My test environment is the following:

dom0: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 8GB memory, 4 physical CPUs, AMD Rev. F
domU: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 1GB memory, 4 vCPUs

Inside the domU, I run a kernel compile in a loop with "make -j5".
On the dom0, I run an "xm save <dom> /var/lib/xen/save/<dom>-save ; xm restore
/var/lib/xen/save/<dom>-save" in a loop.

After about 100 iterations of the above save/restore loop, the save will hang,
with the following messages in the domU console:

do_IRQ: stack overflow: 488
 [<c0406e3c>] <1>BUG: unable to handle kernel paging request at virtual address
35416001

(full stack trace is attached).  Note that with the same load, I do not see this
problem on x86_64.

-- Additional comment from clalance on 2007-12-14 16:24 EST --
Created an attachment (id=289491)
Full stack trace when the crash occurs


-- Additional comment from ijc.uk on 2008-01-18 11:36 EST --
I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs
which we have reported against 5.1) when doing migration under load.

Unfortunately either the dump_trace() code is buggy or our attempt to call it
causes us to really overflow so our backtrace is less useful than yours. I will
attach it anyway.

-- Additional comment from ijc.uk on 2008-01-18 11:37 EST --
Created an attachment (id=292173)
Another similar stack trace


-- Additional comment from ddutile on 2008-01-30 15:46 EST --
Created an attachment (id=293483)
vcpu-init stack overflow patch


-- Additional comment from ddutile on 2008-01-30 15:47 EST --
Ian,

Can you give the patch in #4 a try in your test scenario.
Seems to work for our test case (so far; test look exceeding 140 now; will let
it continue to run overnight...)

- Don

-- Additional comment from clalance on 2008-01-30 16:35 EST --
The patch attached here looks like it will work, because __cpu_up() is
serialized in the common kernel/cpu.c code.  That being said, it looks like
upstream LKML xen went a different way, and just used some kzalloc() memory in
arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c).  There's no
real changeset to point to; it was merged into Linus's tree that way.  Anyway,
just thought I'd point this out, since this way has some additional upstream
testing behind it.

Chris Lalancette

-- Additional comment from ddutile on 2008-01-30 17:09 EST --
Created an attachment (id=293499)
Updated/new patch based on LKML version; note: used kmalloc & memset so it
works on older linux versions.


-- Additional comment from ddutile on 2008-01-30 17:25 EST --
Created an attachment (id=293504)
oops... last patch had incorrect memset;  caught it before starting overnight
test..... nothing like exchanging one stack overflow for a stack trashing! ;-)


-- Additional comment from ijc.uk on 2008-01-31 09:20 EST --
Latest patch crashes on boot for me.

I thought at first it was because:
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt));
should be
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt));

But changing that didn't fix it.

Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is
unecessary since this will be clear as part of the memset over the entire ctxt.
Shouldn't jurt to do it though.

-- Additional comment from ddutile on 2008-01-31 10:05 EST --
Created an attachment (id=293587)
a fix to the last fix!

Ian,

yes, the hypervisor call bugs out.... and this piece of code isn't much
better either:
	 kmalloc(sizeof(*ctxt), GFP_KERNEL);  <-- ctxt isn't set equal to
return!

turns out i did an overnight test w/the static, and that ran for over 600
iterations, so the concept of reducing stack size definitely solves the
problem.
Now, to implement the patch correctly! ;-)

So, sorry for the late day mess up. 
Try this updated patch.  if that runs long today, I'll post it on xen-devel,
as well as dupe it for rhel4.

- Don

Comment 2 RHEL Program Management 2008-02-25 19:35:07 UTC
Product Management has reviewed and declined this request.  You may appeal this
decision by reopening this request. 

Comment 3 Chris Lalancette 2008-02-25 19:43:08 UTC
Sigh.  The stupid bot strikes again.  Re-opening.

Chris Lalancette

Comment 4 Kevin Krafthefer 2008-02-25 21:18:21 UTC
why is the bug set to priority "low"? 

Comment 5 Don Dutile (Red Hat) 2008-02-25 22:09:32 UTC
Changed priority to 'high';

Failure to save/restore also means failure to migrate.
Migration is supported and key feature of xen-based guests.

This fix is in rhel5.2, and brings rhel4-xenU to equiv., supported, functionality.

This test case is a good load mixture (disk IO) and is not a contrived scenario.

Comment 6 Ege Turgay 2008-03-07 09:57:01 UTC
We've found that the migration feature works alright with 64 bit RHEL4.

It is clearly working with 32bit RHEL4 but not migrating, for the current
circumtances, our problem is solved with installing 64bit RHEL4 VM.



Comment 7 Chris Lalancette 2008-03-07 12:46:25 UTC
Yeah, this bug strictly effects 32-bit guests, so moving to 64-bit would
alleviate this.

Chris Lalancette

Comment 8 Don Dutile (Red Hat) 2008-03-12 17:21:00 UTC
Created attachment 297808 [details]
Posted attachment for 4.7 inclusion

Posted patch for rhel4.7 inclusion.

Comment 9 Bill Burns 2008-03-12 17:54:22 UTC
Set dev ack for Don.


Comment 11 Vivek Goyal 2008-03-18 16:01:25 UTC
Committed in 68.22.EL. Released in 68.23.EL. RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 14 errata-xmlrpc 2008-07-24 19:26:12 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html