Description of problem:
My test environment is the following:
dom0: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 8GB memory, 4 physical CPUs, AMD Rev. F
domU: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 1GB memory, 4 vCPUs
Inside the domU, I run a kernel compile in a loop with "make -j5".
On the dom0, I run an "xm save <dom> /var/lib/xen/save/<dom>-save ; xm restore
/var/lib/xen/save/<dom>-save" in a loop.
After about 100 iterations of the above save/restore loop, the save will hang,
with the following messages in the domU console:
do_IRQ: stack overflow: 488
[<c0406e3c>] <1>BUG: unable to handle kernel paging request at virtual address
(full stack trace is attached). Note that with the same load, I do not see this
problem on x86_64.
Created attachment 289491 [details]
Full stack trace when the crash occurs
I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs
which we have reported against 5.1) when doing migration under load.
Unfortunately either the dump_trace() code is buggy or our attempt to call it
causes us to really overflow so our backtrace is less useful than yours. I will
attach it anyway.
Created attachment 292173 [details]
Another similar stack trace
Created attachment 293483 [details]
vcpu-init stack overflow patch
Can you give the patch in #4 a try in your test scenario.
Seems to work for our test case (so far; test look exceeding 140 now; will let
it continue to run overnight...)
The patch attached here looks like it will work, because __cpu_up() is
serialized in the common kernel/cpu.c code. That being said, it looks like
upstream LKML xen went a different way, and just used some kzalloc() memory in
arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c). There's no
real changeset to point to; it was merged into Linus's tree that way. Anyway,
just thought I'd point this out, since this way has some additional upstream
testing behind it.
Created attachment 293499 [details]
Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions.
Created attachment 293504 [details]
oops... last patch had incorrect memset; caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-)
Latest patch crashes on boot for me.
I thought at first it was because:
BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt));
BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt));
But changing that didn't fix it.
Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is
unecessary since this will be clear as part of the memset over the entire ctxt.
Shouldn't jurt to do it though.
Created attachment 293587 [details]
a fix to the last fix!
yes, the hypervisor call bugs out.... and this piece of code isn't much
kmalloc(sizeof(*ctxt), GFP_KERNEL); <-- ctxt isn't set equal to
turns out i did an overnight test w/the static, and that ran for over 600
iterations, so the concept of reducing stack size definitely solves the
Now, to implement the patch correctly! ;-)
So, sorry for the late day mess up.
Try this updated patch. if that runs long today, I'll post it on xen-devel,
as well as dupe it for rhel4.
Setting flags and assigning.
Patch in #10 posted.
You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.