Bug 663881

Summary: RHEL6 guest becomes hang after migratation
Product: Red Hat Enterprise Linux 6 Reporter: Mark Wu <dwu>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: drjones, mrezanin, pbonzini, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-14 07:21:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 653816    
Attachments:
Description Flags
debugging patch
none
kernel log on domU after the first migration none

Comment 2 Paolo Bonzini 2010-12-17 13:29:49 UTC
Migration works differently depending on whether the guest is HVM or PV (HVM with PV drivers counts as PV).

For HVM guests without PV drivers Xen does nothing special, it just copies the state, destroys the domain, and restarts it on the destination side.

This is a PV guest though, and in this case in fact we *expect* a "suspending xenstore..." message and expect the host to write to "control/shutdown".  In this case, Xen will write "suspend" to "control/shutdown" to signify that the (optionally) live part of the migration is being completed.  The code for handling the write of "suspend" looks like

    suspend_everything ();
    HYPERCALL_suspend ();
    resume_everything ();

suspend_everything will ensure that the PV drivers are quiescent and save any state that it needs upon resume (this is because Xenstore entries are not transmitted, so the PV frontend/backend handshake will have to be redone and any pending I/O restarted).  As soon as the guest calls the hypercall, Xen pauses the domain, completes the migration and destroys the original domain.

The other side restores the domain context and resumes after the hypercall, so the migration destination will go on executing resume_everything().  It seems to me that something was not suspended or resumed successfully.

====

What is not normal is seeing five "suspending xenstore..." messages.  Would it be possible to add the following patch and try again reproducing this with printk_time=1?

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index f5162e4..30712d9 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -97,9 +97,11 @@
 
 static void do_suspend(void)
 {
+	static int depth;
 	int err;
 	int cancelled = 1;
 
+	depth++;
 	shutting_down = SHUTDOWN_SUSPEND;
 
 	err = stop_machine_create();
@@ -125,7 +127,7 @@
 		goto out_thaw;
 	}
 
-	printk(KERN_DEBUG "suspending xenstore...\n");
+	printk(KERN_DEBUG "suspending xenstore %d...\n", depth);
 	xs_suspend();
 
 	err = dpm_suspend_noirq(PMSG_SUSPEND);
@@ -168,6 +170,7 @@
 
 out:
 	shutting_down = SHUTDOWN_INVALID;
+	depth--;
 }
 #endif	/* CONFIG_PM_SLEEP */
 
@@ -178,6 +181,7 @@
 	struct xenbus_transaction xbt;
 	int err;
 
+	printk(KERN_DEBUG "Triggered shutdown watch\n");
 	if (shutting_down != SHUTDOWN_INVALID)
 		return;
 
@@ -201,6 +205,7 @@
 		goto again;
 	}
 
+	printk(KERN_DEBUG "Got shutdown request %s\n", str);
 	if (strcmp(str, "poweroff") == 0 ||
 	    strcmp(str, "halt") == 0) {
 		shutting_down = SHUTDOWN_POWEROFF;

===

Also, a couple of questions just to be sure:

> 1. guest "vmtst.uark.edu" migrated from cvprd3 to cvprd1 at 17:03 CST
> 2. guest successfully restored on cvprd3 at 17:04:35

Did you mean on cvprd1 here?

> 4. At this stage we crashed the guest (at 17:07:56) and collected the dump
> 6. After migrating back to cvprd3, the guest works fine again

So this means the problem is reproducible, and the two-step cvprd3->cvprd1->cvprd3 migration was done in another step?

Comment 3 Paolo Bonzini 2010-12-17 13:31:20 UTC
Created attachment 469361 [details]
debugging patch

for (i = 0; i < 500; i++)
    puts("I will not paste patches into Bugzilla");

Comment 6 Andrew Jones 2011-01-10 15:22:18 UTC
I believe this has been fixed upstream. On machines that had similar symptoms (could migrate one direction, but not the other) with the RHEL6 kernel, I was able to migrate both directions with a Fedora 15 kernel. I need to figure out what patches fix it.

Comment 9 Andrew Jones 2011-01-11 19:37:28 UTC
In bug 663755 comment 20 I've pointed to an upstream patch that I believe will fix this issue. I can make a test kernel rpm available for the customer if they would like to run it. I'll build it tomorrow morning.

Comment 12 Mark Wu 2011-01-14 06:13:20 UTC
Andrew,
The customer has verified that this patch can fix the problem. 
Many thanks!

Comment 13 Andrew Jones 2011-01-14 07:21:04 UTC

*** This bug has been marked as a duplicate of bug 663755 ***