Red Hat Bugzilla – Bug 431081
[RHEL4.6]: Under load, an i386 PV guest on i386 HV will hang during save/restore
Last modified: 2008-07-24 15:26:12 EDT
+++ This bug was initially created as a clone of Bug #425471 +++
Description of problem:
My test environment is the following:
dom0: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 8GB memory, 4 physical CPUs, AMD Rev. F
domU: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 1GB memory, 4 vCPUs
Inside the domU, I run a kernel compile in a loop with "make -j5".
On the dom0, I run an "xm save <dom> /var/lib/xen/save/<dom>-save ; xm restore
/var/lib/xen/save/<dom>-save" in a loop.
After about 100 iterations of the above save/restore loop, the save will hang,
with the following messages in the domU console:
do_IRQ: stack overflow: 488
[<c0406e3c>] <1>BUG: unable to handle kernel paging request at virtual address
(full stack trace is attached). Note that with the same load, I do not see this
problem on x86_64.
-- Additional comment from email@example.com on 2007-12-14 16:24 EST --
Created an attachment (id=289491)
Full stack trace when the crash occurs
-- Additional comment from firstname.lastname@example.org on 2008-01-18 11:36 EST --
I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs
which we have reported against 5.1) when doing migration under load.
Unfortunately either the dump_trace() code is buggy or our attempt to call it
causes us to really overflow so our backtrace is less useful than yours. I will
attach it anyway.
-- Additional comment from email@example.com on 2008-01-18 11:37 EST --
Created an attachment (id=292173)
Another similar stack trace
-- Additional comment from firstname.lastname@example.org on 2008-01-30 15:46 EST --
Created an attachment (id=293483)
vcpu-init stack overflow patch
-- Additional comment from email@example.com on 2008-01-30 15:47 EST --
Can you give the patch in #4 a try in your test scenario.
Seems to work for our test case (so far; test look exceeding 140 now; will let
it continue to run overnight...)
-- Additional comment from firstname.lastname@example.org on 2008-01-30 16:35 EST --
The patch attached here looks like it will work, because __cpu_up() is
serialized in the common kernel/cpu.c code. That being said, it looks like
upstream LKML xen went a different way, and just used some kzalloc() memory in
arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c). There's no
real changeset to point to; it was merged into Linus's tree that way. Anyway,
just thought I'd point this out, since this way has some additional upstream
testing behind it.
-- Additional comment from email@example.com on 2008-01-30 17:09 EST --
Created an attachment (id=293499)
Updated/new patch based on LKML version; note: used kmalloc & memset so it
works on older linux versions.
-- Additional comment from firstname.lastname@example.org on 2008-01-30 17:25 EST --
Created an attachment (id=293504)
oops... last patch had incorrect memset; caught it before starting overnight
test..... nothing like exchanging one stack overflow for a stack trashing! ;-)
-- Additional comment from email@example.com on 2008-01-31 09:20 EST --
Latest patch crashes on boot for me.
I thought at first it was because:
BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt));
BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt));
But changing that didn't fix it.
Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is
unecessary since this will be clear as part of the memset over the entire ctxt.
Shouldn't jurt to do it though.
-- Additional comment from firstname.lastname@example.org on 2008-01-31 10:05 EST --
Created an attachment (id=293587)
a fix to the last fix!
yes, the hypervisor call bugs out.... and this piece of code isn't much
kmalloc(sizeof(*ctxt), GFP_KERNEL); <-- ctxt isn't set equal to
turns out i did an overnight test w/the static, and that ran for over 600
iterations, so the concept of reducing stack size definitely solves the
Now, to implement the patch correctly! ;-)
So, sorry for the late day mess up.
Try this updated patch. if that runs long today, I'll post it on xen-devel,
as well as dupe it for rhel4.
Product Management has reviewed and declined this request. You may appeal this
decision by reopening this request.
Sigh. The stupid bot strikes again. Re-opening.
why is the bug set to priority "low"?
Changed priority to 'high';
Failure to save/restore also means failure to migrate.
Migration is supported and key feature of xen-based guests.
This fix is in rhel5.2, and brings rhel4-xenU to equiv., supported, functionality.
This test case is a good load mixture (disk IO) and is not a contrived scenario.
We've found that the migration feature works alright with 64 bit RHEL4.
It is clearly working with 32bit RHEL4 but not migrating, for the current
circumtances, our problem is solved with installing 64bit RHEL4 VM.
Yeah, this bug strictly effects 32-bit guests, so moving to 64-bit would
Created attachment 297808 [details]
Posted attachment for 4.7 inclusion
Posted patch for rhel4.7 inclusion.
Set dev ack for Don.
Committed in 68.22.EL. Released in 68.23.EL. RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.