Description of problem: My test environment is the following: dom0: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 8GB memory, 4 physical CPUs, AMD Rev. F domU: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 1GB memory, 4 vCPUs Inside the domU, I run a kernel compile in a loop with "make -j5". On the dom0, I run an "xm save <dom> /var/lib/xen/save/<dom>-save ; xm restore /var/lib/xen/save/<dom>-save" in a loop. After about 100 iterations of the above save/restore loop, the save will hang, with the following messages in the domU console: do_IRQ: stack overflow: 488 [<c0406e3c>] <1>BUG: unable to handle kernel paging request at virtual address 35416001 (full stack trace is attached). Note that with the same load, I do not see this problem on x86_64.
Created attachment 289491 [details] Full stack trace when the crash occurs
I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs which we have reported against 5.1) when doing migration under load. Unfortunately either the dump_trace() code is buggy or our attempt to call it causes us to really overflow so our backtrace is less useful than yours. I will attach it anyway.
Created attachment 292173 [details] Another similar stack trace
Created attachment 293483 [details] vcpu-init stack overflow patch
Ian, Can you give the patch in #4 a try in your test scenario. Seems to work for our test case (so far; test look exceeding 140 now; will let it continue to run overnight...) - Don
The patch attached here looks like it will work, because __cpu_up() is serialized in the common kernel/cpu.c code. That being said, it looks like upstream LKML xen went a different way, and just used some kzalloc() memory in arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c). There's no real changeset to point to; it was merged into Linus's tree that way. Anyway, just thought I'd point this out, since this way has some additional upstream testing behind it. Chris Lalancette
Created attachment 293499 [details] Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions.
Created attachment 293504 [details] oops... last patch had incorrect memset; caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-)
Latest patch crashes on boot for me. I thought at first it was because: BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt)); should be BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt)); But changing that didn't fix it. Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is unecessary since this will be clear as part of the memset over the entire ctxt. Shouldn't jurt to do it though.
Created attachment 293587 [details] a fix to the last fix! Ian, yes, the hypervisor call bugs out.... and this piece of code isn't much better either: kmalloc(sizeof(*ctxt), GFP_KERNEL); <-- ctxt isn't set equal to return! turns out i did an overnight test w/the static, and that ran for over 600 iterations, so the concept of reducing stack size definitely solves the problem. Now, to implement the patch correctly! ;-) So, sorry for the late day mess up. Try this updated patch. if that runs long today, I'll post it on xen-devel, as well as dupe it for rhel4. - Don
Setting flags and assigning.
Patch in #10 posted.
in 2.6.18-78.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html