Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Migration works differently depending on whether the guest is HVM or PV (HVM with PV drivers counts as PV).
For HVM guests without PV drivers Xen does nothing special, it just copies the state, destroys the domain, and restarts it on the destination side.
This is a PV guest though, and in this case in fact we *expect* a "suspending xenstore..." message and expect the host to write to "control/shutdown". In this case, Xen will write "suspend" to "control/shutdown" to signify that the (optionally) live part of the migration is being completed. The code for handling the write of "suspend" looks like
suspend_everything ();
HYPERCALL_suspend ();
resume_everything ();
suspend_everything will ensure that the PV drivers are quiescent and save any state that it needs upon resume (this is because Xenstore entries are not transmitted, so the PV frontend/backend handshake will have to be redone and any pending I/O restarted). As soon as the guest calls the hypercall, Xen pauses the domain, completes the migration and destroys the original domain.
The other side restores the domain context and resumes after the hypercall, so the migration destination will go on executing resume_everything(). It seems to me that something was not suspended or resumed successfully.
====
What is not normal is seeing five "suspending xenstore..." messages. Would it be possible to add the following patch and try again reproducing this with printk_time=1?
diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index f5162e4..30712d9 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -97,9 +97,11 @@
static void do_suspend(void)
{
+ static int depth;
int err;
int cancelled = 1;
+ depth++;
shutting_down = SHUTDOWN_SUSPEND;
err = stop_machine_create();
@@ -125,7 +127,7 @@
goto out_thaw;
}
- printk(KERN_DEBUG "suspending xenstore...\n");
+ printk(KERN_DEBUG "suspending xenstore %d...\n", depth);
xs_suspend();
err = dpm_suspend_noirq(PMSG_SUSPEND);
@@ -168,6 +170,7 @@
out:
shutting_down = SHUTDOWN_INVALID;
+ depth--;
}
#endif /* CONFIG_PM_SLEEP */
@@ -178,6 +181,7 @@
struct xenbus_transaction xbt;
int err;
+ printk(KERN_DEBUG "Triggered shutdown watch\n");
if (shutting_down != SHUTDOWN_INVALID)
return;
@@ -201,6 +205,7 @@
goto again;
}
+ printk(KERN_DEBUG "Got shutdown request %s\n", str);
if (strcmp(str, "poweroff") == 0 ||
strcmp(str, "halt") == 0) {
shutting_down = SHUTDOWN_POWEROFF;
===
Also, a couple of questions just to be sure:
> 1. guest "vmtst.uark.edu" migrated from cvprd3 to cvprd1 at 17:03 CST
> 2. guest successfully restored on cvprd3 at 17:04:35
Did you mean on cvprd1 here?
> 4. At this stage we crashed the guest (at 17:07:56) and collected the dump
> 6. After migrating back to cvprd3, the guest works fine again
So this means the problem is reproducible, and the two-step cvprd3->cvprd1->cvprd3 migration was done in another step?
I believe this has been fixed upstream. On machines that had similar symptoms (could migrate one direction, but not the other) with the RHEL6 kernel, I was able to migrate both directions with a Fedora 15 kernel. I need to figure out what patches fix it.
In bug 663755 comment 20 I've pointed to an upstream patch that I believe will fix this issue. I can make a test kernel rpm available for the customer if they would like to run it. I'll build it tomorrow morning.
Migration works differently depending on whether the guest is HVM or PV (HVM with PV drivers counts as PV). For HVM guests without PV drivers Xen does nothing special, it just copies the state, destroys the domain, and restarts it on the destination side. This is a PV guest though, and in this case in fact we *expect* a "suspending xenstore..." message and expect the host to write to "control/shutdown". In this case, Xen will write "suspend" to "control/shutdown" to signify that the (optionally) live part of the migration is being completed. The code for handling the write of "suspend" looks like suspend_everything (); HYPERCALL_suspend (); resume_everything (); suspend_everything will ensure that the PV drivers are quiescent and save any state that it needs upon resume (this is because Xenstore entries are not transmitted, so the PV frontend/backend handshake will have to be redone and any pending I/O restarted). As soon as the guest calls the hypercall, Xen pauses the domain, completes the migration and destroys the original domain. The other side restores the domain context and resumes after the hypercall, so the migration destination will go on executing resume_everything(). It seems to me that something was not suspended or resumed successfully. ==== What is not normal is seeing five "suspending xenstore..." messages. Would it be possible to add the following patch and try again reproducing this with printk_time=1? diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c index f5162e4..30712d9 100644 --- a/drivers/xen/manage.c +++ b/drivers/xen/manage.c @@ -97,9 +97,11 @@ static void do_suspend(void) { + static int depth; int err; int cancelled = 1; + depth++; shutting_down = SHUTDOWN_SUSPEND; err = stop_machine_create(); @@ -125,7 +127,7 @@ goto out_thaw; } - printk(KERN_DEBUG "suspending xenstore...\n"); + printk(KERN_DEBUG "suspending xenstore %d...\n", depth); xs_suspend(); err = dpm_suspend_noirq(PMSG_SUSPEND); @@ -168,6 +170,7 @@ out: shutting_down = SHUTDOWN_INVALID; + depth--; } #endif /* CONFIG_PM_SLEEP */ @@ -178,6 +181,7 @@ struct xenbus_transaction xbt; int err; + printk(KERN_DEBUG "Triggered shutdown watch\n"); if (shutting_down != SHUTDOWN_INVALID) return; @@ -201,6 +205,7 @@ goto again; } + printk(KERN_DEBUG "Got shutdown request %s\n", str); if (strcmp(str, "poweroff") == 0 || strcmp(str, "halt") == 0) { shutting_down = SHUTDOWN_POWEROFF; === Also, a couple of questions just to be sure: > 1. guest "vmtst.uark.edu" migrated from cvprd3 to cvprd1 at 17:03 CST > 2. guest successfully restored on cvprd3 at 17:04:35 Did you mean on cvprd1 here? > 4. At this stage we crashed the guest (at 17:07:56) and collected the dump > 6. After migrating back to cvprd3, the guest works fine again So this means the problem is reproducible, and the two-step cvprd3->cvprd1->cvprd3 migration was done in another step?