+++ This bug was initially created as a clone of Bug #425471 +++ Description of problem: My test environment is the following: dom0: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 8GB memory, 4 physical CPUs, AMD Rev. F domU: RHEL5.2 i686 kernel, 2.6.18-59.el5xen, 1GB memory, 4 vCPUs Inside the domU, I run a kernel compile in a loop with "make -j5". On the dom0, I run an "xm save <dom> /var/lib/xen/save/<dom>-save ; xm restore /var/lib/xen/save/<dom>-save" in a loop. After about 100 iterations of the above save/restore loop, the save will hang, with the following messages in the domU console: do_IRQ: stack overflow: 488 [<c0406e3c>] <1>BUG: unable to handle kernel paging request at virtual address 35416001 (full stack trace is attached). Note that with the same load, I do not see this problem on x86_64. -- Additional comment from clalance on 2007-12-14 16:24 EST -- Created an attachment (id=289491) Full stack trace when the crash occurs -- Additional comment from ijc.uk on 2008-01-18 11:36 EST -- I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs which we have reported against 5.1) when doing migration under load. Unfortunately either the dump_trace() code is buggy or our attempt to call it causes us to really overflow so our backtrace is less useful than yours. I will attach it anyway. -- Additional comment from ijc.uk on 2008-01-18 11:37 EST -- Created an attachment (id=292173) Another similar stack trace -- Additional comment from ddutile on 2008-01-30 15:46 EST -- Created an attachment (id=293483) vcpu-init stack overflow patch -- Additional comment from ddutile on 2008-01-30 15:47 EST -- Ian, Can you give the patch in #4 a try in your test scenario. Seems to work for our test case (so far; test look exceeding 140 now; will let it continue to run overnight...) - Don -- Additional comment from clalance on 2008-01-30 16:35 EST -- The patch attached here looks like it will work, because __cpu_up() is serialized in the common kernel/cpu.c code. That being said, it looks like upstream LKML xen went a different way, and just used some kzalloc() memory in arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c). There's no real changeset to point to; it was merged into Linus's tree that way. Anyway, just thought I'd point this out, since this way has some additional upstream testing behind it. Chris Lalancette -- Additional comment from ddutile on 2008-01-30 17:09 EST -- Created an attachment (id=293499) Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions. -- Additional comment from ddutile on 2008-01-30 17:25 EST -- Created an attachment (id=293504) oops... last patch had incorrect memset; caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-) -- Additional comment from ijc.uk on 2008-01-31 09:20 EST -- Latest patch crashes on boot for me. I thought at first it was because: BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt)); should be BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt)); But changing that didn't fix it. Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is unecessary since this will be clear as part of the memset over the entire ctxt. Shouldn't jurt to do it though. -- Additional comment from ddutile on 2008-01-31 10:05 EST -- Created an attachment (id=293587) a fix to the last fix! Ian, yes, the hypervisor call bugs out.... and this piece of code isn't much better either: kmalloc(sizeof(*ctxt), GFP_KERNEL); <-- ctxt isn't set equal to return! turns out i did an overnight test w/the static, and that ran for over 600 iterations, so the concept of reducing stack size definitely solves the problem. Now, to implement the patch correctly! ;-) So, sorry for the late day mess up. Try this updated patch. if that runs long today, I'll post it on xen-devel, as well as dupe it for rhel4. - Don
Product Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Sigh. The stupid bot strikes again. Re-opening. Chris Lalancette
why is the bug set to priority "low"?
Changed priority to 'high'; Failure to save/restore also means failure to migrate. Migration is supported and key feature of xen-based guests. This fix is in rhel5.2, and brings rhel4-xenU to equiv., supported, functionality. This test case is a good load mixture (disk IO) and is not a contrived scenario.
We've found that the migration feature works alright with 64 bit RHEL4. It is clearly working with 32bit RHEL4 but not migrating, for the current circumtances, our problem is solved with installing 64bit RHEL4 VM.
Yeah, this bug strictly effects 32-bit guests, so moving to 64-bit would alleviate this. Chris Lalancette
Created attachment 297808 [details] Posted attachment for 4.7 inclusion Posted patch for rhel4.7 inclusion.
Set dev ack for Don.
Committed in 68.22.EL. Released in 68.23.EL. RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html