Bug 663881
| Summary: | RHEL6 guest becomes hang after migratation | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Mark Wu <dwu> | ||||||
| Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | low | ||||||||
| Version: | 6.0 | CC: | drjones, mrezanin, pbonzini, xen-maint | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-01-14 07:21:04 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 653816 | ||||||||
| Attachments: |
|
||||||||
Created attachment 469361 [details]
debugging patch
for (i = 0; i < 500; i++)
puts("I will not paste patches into Bugzilla");
I believe this has been fixed upstream. On machines that had similar symptoms (could migrate one direction, but not the other) with the RHEL6 kernel, I was able to migrate both directions with a Fedora 15 kernel. I need to figure out what patches fix it. In bug 663755 comment 20 I've pointed to an upstream patch that I believe will fix this issue. I can make a test kernel rpm available for the customer if they would like to run it. I'll build it tomorrow morning. Andrew, The customer has verified that this patch can fix the problem. Many thanks! *** This bug has been marked as a duplicate of bug 663755 *** |
Migration works differently depending on whether the guest is HVM or PV (HVM with PV drivers counts as PV). For HVM guests without PV drivers Xen does nothing special, it just copies the state, destroys the domain, and restarts it on the destination side. This is a PV guest though, and in this case in fact we *expect* a "suspending xenstore..." message and expect the host to write to "control/shutdown". In this case, Xen will write "suspend" to "control/shutdown" to signify that the (optionally) live part of the migration is being completed. The code for handling the write of "suspend" looks like suspend_everything (); HYPERCALL_suspend (); resume_everything (); suspend_everything will ensure that the PV drivers are quiescent and save any state that it needs upon resume (this is because Xenstore entries are not transmitted, so the PV frontend/backend handshake will have to be redone and any pending I/O restarted). As soon as the guest calls the hypercall, Xen pauses the domain, completes the migration and destroys the original domain. The other side restores the domain context and resumes after the hypercall, so the migration destination will go on executing resume_everything(). It seems to me that something was not suspended or resumed successfully. ==== What is not normal is seeing five "suspending xenstore..." messages. Would it be possible to add the following patch and try again reproducing this with printk_time=1? diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c index f5162e4..30712d9 100644 --- a/drivers/xen/manage.c +++ b/drivers/xen/manage.c @@ -97,9 +97,11 @@ static void do_suspend(void) { + static int depth; int err; int cancelled = 1; + depth++; shutting_down = SHUTDOWN_SUSPEND; err = stop_machine_create(); @@ -125,7 +127,7 @@ goto out_thaw; } - printk(KERN_DEBUG "suspending xenstore...\n"); + printk(KERN_DEBUG "suspending xenstore %d...\n", depth); xs_suspend(); err = dpm_suspend_noirq(PMSG_SUSPEND); @@ -168,6 +170,7 @@ out: shutting_down = SHUTDOWN_INVALID; + depth--; } #endif /* CONFIG_PM_SLEEP */ @@ -178,6 +181,7 @@ struct xenbus_transaction xbt; int err; + printk(KERN_DEBUG "Triggered shutdown watch\n"); if (shutting_down != SHUTDOWN_INVALID) return; @@ -201,6 +205,7 @@ goto again; } + printk(KERN_DEBUG "Got shutdown request %s\n", str); if (strcmp(str, "poweroff") == 0 || strcmp(str, "halt") == 0) { shutting_down = SHUTDOWN_POWEROFF; === Also, a couple of questions just to be sure: > 1. guest "vmtst.uark.edu" migrated from cvprd3 to cvprd1 at 17:03 CST > 2. guest successfully restored on cvprd3 at 17:04:35 Did you mean on cvprd1 here? > 4. At this stage we crashed the guest (at 17:07:56) and collected the dump > 6. After migrating back to cvprd3, the guest works fine again So this means the problem is reproducible, and the two-step cvprd3->cvprd1->cvprd3 migration was done in another step?