663881 – RHEL6 guest becomes hang after migratation

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 663881 - RHEL6 guest becomes hang after migratation

Summary: RHEL6 guest becomes hang after migratation

Keywords:
Status:	CLOSED DUPLICATE of bug 663755
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Red Hat Kernel Manager
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	653816
TreeView+	depends on / blocked

Reported:	2010-12-17 07:10 UTC by Mark Wu
Modified:	2011-01-14 07:34 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-14 07:21:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
debugging patch (1.19 KB, patch) 2010-12-17 13:31 UTC, Paolo Bonzini	no flags	Details \| Diff
kernel log on domU after the first migration (186.88 KB, application/octet-stream) 2011-01-11 08:59 UTC, Mark Wu	no flags	Details
View All

Comment 2 Paolo Bonzini 2010-12-17 13:29:49 UTC

Migration works differently depending on whether the guest is HVM or PV (HVM with PV drivers counts as PV).

For HVM guests without PV drivers Xen does nothing special, it just copies the state, destroys the domain, and restarts it on the destination side.

This is a PV guest though, and in this case in fact we *expect* a "suspending xenstore..." message and expect the host to write to "control/shutdown".  In this case, Xen will write "suspend" to "control/shutdown" to signify that the (optionally) live part of the migration is being completed.  The code for handling the write of "suspend" looks like

    suspend_everything ();
    HYPERCALL_suspend ();
    resume_everything ();

suspend_everything will ensure that the PV drivers are quiescent and save any state that it needs upon resume (this is because Xenstore entries are not transmitted, so the PV frontend/backend handshake will have to be redone and any pending I/O restarted).  As soon as the guest calls the hypercall, Xen pauses the domain, completes the migration and destroys the original domain.

The other side restores the domain context and resumes after the hypercall, so the migration destination will go on executing resume_everything().  It seems to me that something was not suspended or resumed successfully.

====

What is not normal is seeing five "suspending xenstore..." messages.  Would it be possible to add the following patch and try again reproducing this with printk_time=1?

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index f5162e4..30712d9 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -97,9 +97,11 @@
 
 static void do_suspend(void)
 {
+	static int depth;
 	int err;
 	int cancelled = 1;
 
+	depth++;
 	shutting_down = SHUTDOWN_SUSPEND;
 
 	err = stop_machine_create();
@@ -125,7 +127,7 @@
 		goto out_thaw;
 	}
 
-	printk(KERN_DEBUG "suspending xenstore...\n");
+	printk(KERN_DEBUG "suspending xenstore %d...\n", depth);
 	xs_suspend();
 
 	err = dpm_suspend_noirq(PMSG_SUSPEND);
@@ -168,6 +170,7 @@
 
 out:
 	shutting_down = SHUTDOWN_INVALID;
+	depth--;
 }
 #endif	/* CONFIG_PM_SLEEP */
 
@@ -178,6 +181,7 @@
 	struct xenbus_transaction xbt;
 	int err;
 
+	printk(KERN_DEBUG "Triggered shutdown watch\n");
 	if (shutting_down != SHUTDOWN_INVALID)
 		return;
 
@@ -201,6 +205,7 @@
 		goto again;
 	}
 
+	printk(KERN_DEBUG "Got shutdown request %s\n", str);
 	if (strcmp(str, "poweroff") == 0 ||
 	    strcmp(str, "halt") == 0) {
 		shutting_down = SHUTDOWN_POWEROFF;

===

Also, a couple of questions just to be sure:

> 1. guest "vmtst.uark.edu" migrated from cvprd3 to cvprd1 at 17:03 CST
> 2. guest successfully restored on cvprd3 at 17:04:35

Did you mean on cvprd1 here?

> 4. At this stage we crashed the guest (at 17:07:56) and collected the dump
> 6. After migrating back to cvprd3, the guest works fine again

So this means the problem is reproducible, and the two-step cvprd3->cvprd1->cvprd3 migration was done in another step?

Comment 3 Paolo Bonzini 2010-12-17 13:31:20 UTC

Created attachment 469361 [details]
debugging patch

for (i = 0; i < 500; i++)
    puts("I will not paste patches into Bugzilla");

Comment 6 Andrew Jones 2011-01-10 15:22:18 UTC

I believe this has been fixed upstream. On machines that had similar symptoms (could migrate one direction, but not the other) with the RHEL6 kernel, I was able to migrate both directions with a Fedora 15 kernel. I need to figure out what patches fix it.

Comment 9 Andrew Jones 2011-01-11 19:37:28 UTC

In bug 663755 comment 20 I've pointed to an upstream patch that I believe will fix this issue. I can make a test kernel rpm available for the customer if they would like to run it. I'll build it tomorrow morning.

Comment 12 Mark Wu 2011-01-14 06:13:20 UTC

Andrew,
The customer has verified that this patch can fix the problem. 
Many thanks!

Comment 13 Andrew Jones 2011-01-14 07:21:04 UTC


*** This bug has been marked as a duplicate of bug 663755 ***

Note You need to log in before you can comment on or make changes to this bug.