Bug 425471
| Summary: | [RHEL5.2]: Under load, an i386 PV guest on i386 HV will hang during save/restore | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Chris Lalancette <clalance> |
| Component: | kernel-xen | Assignee: | Don Dutile (Red Hat) <ddutile> |
| Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 5.2 | CC: | ddutile, ijc, xen-maint |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | RHBA-2008-0314 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2008-05-21 15:04:20 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 431081 | ||
| Attachments: | |||
|
Description
Chris Lalancette
2007-12-14 21:24:26 UTC
Created attachment 289491 [details]
Full stack trace when the crash occurs
I think we might be seeing similar on 2.6.18-53.1.4.el5 (plus fixes for the bugs which we have reported against 5.1) when doing migration under load. Unfortunately either the dump_trace() code is buggy or our attempt to call it causes us to really overflow so our backtrace is less useful than yours. I will attach it anyway. Created attachment 292173 [details]
Another similar stack trace
Created attachment 293483 [details]
vcpu-init stack overflow patch
Ian, Can you give the patch in #4 a try in your test scenario. Seems to work for our test case (so far; test look exceeding 140 now; will let it continue to run overnight...) - Don The patch attached here looks like it will work, because __cpu_up() is serialized in the common kernel/cpu.c code. That being said, it looks like upstream LKML xen went a different way, and just used some kzalloc() memory in arch/x86/xen/smp.c (their equivalent to drivers/xen/core/smpboot.c). There's no real changeset to point to; it was merged into Linus's tree that way. Anyway, just thought I'd point this out, since this way has some additional upstream testing behind it. Chris Lalancette Created attachment 293499 [details]
Updated/new patch based on LKML version; note: used kmalloc & memset so it works on older linux versions.
Created attachment 293504 [details]
oops... last patch had incorrect memset; caught it before starting overnight test..... nothing like exchanging one stack overflow for a stack trashing! ;-)
Latest patch crashes on boot for me. I thought at first it was because: BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, &ctxt)); should be BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt)); But changing that didn't fix it. Also I think the "memset(&ctxt->fpu_ctxt, 0, sizeof(ctxt->fpu_ctxt));" is unecessary since this will be clear as part of the memset over the entire ctxt. Shouldn't jurt to do it though. Created attachment 293587 [details]
a fix to the last fix!
Ian,
yes, the hypervisor call bugs out.... and this piece of code isn't much
better either:
kmalloc(sizeof(*ctxt), GFP_KERNEL); <-- ctxt isn't set equal to
return!
turns out i did an overnight test w/the static, and that ran for over 600
iterations, so the concept of reducing stack size definitely solves the
problem.
Now, to implement the patch correctly! ;-)
So, sorry for the late day mess up.
Try this updated patch. if that runs long today, I'll post it on xen-devel,
as well as dupe it for rhel4.
- Don
Setting flags and assigning. Patch in #10 posted. in 2.6.18-78.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html |