425471 – [RHEL5.2]: Under load, an i386 PV guest on i386 HV will hang during save/restore

Bug 425471 - [RHEL5.2]: Under load, an i386 PV guest on i386 HV will hang during save/restore

Summary: [RHEL5.2]: Under load, an i386 PV guest on i386 HV will hang during save/restore

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Don Dutile (Red Hat)
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	431081
TreeView+	depends on / blocked

Reported:	2007-12-14 21:24 UTC by Chris Lalancette
Modified:	2008-05-21 15:04 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2008-0314
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 15:04:20 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Full stack trace when the crash occurs (3.68 KB, text/plain) 2007-12-14 21:24 UTC, Chris Lalancette	no flags	Details
Another similar stack trace (6.79 KB, text/plain) 2008-01-18 16:37 UTC, Ian Campbell	no flags	Details
vcpu-init stack overflow patch (609 bytes, text/plain) 2008-01-30 20:46 UTC, Don Dutile (Red Hat)	no flags	Details
Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions. (4.14 KB, text/plain) 2008-01-30 22:09 UTC, Don Dutile (Red Hat)	no flags	Details
oops... last patch had incorrect memset; caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-) (4.17 KB, text/x-patch) 2008-01-30 22:25 UTC, Don Dutile (Red Hat)	no flags	Details
a fix to the last fix! (4.24 KB, text/plain) 2008-01-31 15:05 UTC, Don Dutile (Red Hat)	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0314	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5.2	2008-05-20 18:43:34 UTC

Description Chris Lalancette 2007-12-14 21:24:26 UTC

Description of problem:
My test environment is the following:

dom0: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 8GB memory, 4 physical CPUs, AMD Rev. F
domU: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 1GB memory, 4 vCPUs

Inside the domU, I run a kernel compile in a loop with "make -j5".
On the dom0, I run an "xm save <dom> /var/lib/xen/save/<dom>-save ; xm restore
/var/lib/xen/save/<dom>-save" in a loop.

After about 100 iterations of the above save/restore loop, the save will hang,
with the following messages in the domU console:

do_IRQ: stack overflow: 488
 [<c0406e3c>] <1>BUG: unable to handle kernel paging request at virtual address
35416001

(full stack trace is attached).  Note that with the same load, I do not see this
problem on x86_64.

Comment 1 Chris Lalancette 2007-12-14 21:24:26 UTC

Created attachment 289491 [details]
Full stack trace when the crash occurs

Comment 2 Ian Campbell 2008-01-18 16:36:31 UTC

I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs
which we have reported against 5.1) when doing migration under load.

Unfortunately either the dump_trace() code is buggy or our attempt to call it
causes us to really overflow so our backtrace is less useful than yours. I will
attach it anyway.

Comment 3 Ian Campbell 2008-01-18 16:37:30 UTC

Created attachment 292173 [details]
Another similar stack trace

Comment 4 Don Dutile (Red Hat) 2008-01-30 20:46:05 UTC

Created attachment 293483 [details]
vcpu-init stack overflow patch

Comment 5 Don Dutile (Red Hat) 2008-01-30 20:47:57 UTC

Ian,

Can you give the patch in #4 a try in your test scenario.
Seems to work for our test case (so far; test look exceeding 140 now; will let
it continue to run overnight...)

- Don

Comment 6 Chris Lalancette 2008-01-30 21:35:38 UTC

The patch attached here looks like it will work, because __cpu_up() is
serialized in the common kernel/cpu.c code.  That being said, it looks like
upstream LKML xen went a different way, and just used some kzalloc() memory in
arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c).  There's no
real changeset to point to; it was merged into Linus's tree that way.  Anyway,
just thought I'd point this out, since this way has some additional upstream
testing behind it.

Chris Lalancette

Comment 7 Don Dutile (Red Hat) 2008-01-30 22:09:11 UTC

Created attachment 293499 [details]
Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions.

Comment 8 Don Dutile (Red Hat) 2008-01-30 22:25:51 UTC

Created attachment 293504 [details]
oops... last patch had incorrect memset;  caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-)

Comment 9 Ian Campbell 2008-01-31 14:20:07 UTC

Latest patch crashes on boot for me.

I thought at first it was because:
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt));
should be
  BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt));

But changing that didn't fix it.

Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is
unecessary since this will be clear as part of the memset over the entire ctxt.
Shouldn't jurt to do it though.

Comment 10 Don Dutile (Red Hat) 2008-01-31 15:05:50 UTC

Created attachment 293587 [details]
a fix to the last fix!

Ian,

yes, the hypervisor call bugs out.... and this piece of code isn't much
better either:
	 kmalloc(sizeof(*ctxt), GFP_KERNEL);  <-- ctxt isn't set equal to
return!

turns out i did an overnight test w/the static, and that ran for over 600
iterations, so the concept of reducing stack size definitely solves the
problem.
Now, to implement the patch correctly! ;-)

So, sorry for the late day mess up. 
Try this updated patch.  if that runs long today, I'll post it on xen-devel,
as well as dupe it for rhel4.

- Don

Comment 11 Bill Burns 2008-01-31 18:28:34 UTC

Setting flags and assigning.

Comment 13 Don Dutile (Red Hat) 2008-01-31 20:46:17 UTC

Patch in #10 posted.

Comment 16 Don Zickus 2008-02-06 20:55:57 UTC

in 2.6.18-78.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 19 errata-xmlrpc 2008-05-21 15:04:20 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Note You need to log in before you can comment on or make changes to this bug.