+++ This bug was initially created as a clone of Bug #1475431 +++
+++ This bug was initially created as a clone of Bug #1475305 +++
Description of problem:
A failure of the migration network during the postcopy phase of migration is fatal; we can't restart the source since the current state is spread between the two hosts. We want to be able to reconnect the migration stream and complete the migration after someone has fixed the network
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Start a migration
2. switch to postcopy mode
3. take an axe to the network cable
4. wait for TCP timeout
5. Replace network cable
Migration fails after the TCP timeout, the destination VM fails, and can't be restarted.
Some way to recover.
--- Additional comment from Jiri Denemark on 2018-05-30 00:05:59 CEST ---
This is now implemented in QEMU by a series of commits ending with
Author: Peter Xu <firstname.lastname@example.org>
AuthorDate: Wed May 2 18:47:40 2018 +0800
Commit: Juan Quintela <email@example.com>
CommitDate: Tue May 15 22:13:08 2018 +0200
migration/hmp: add migrate_pause command
Wrapper for QMP command "migrate-pause".
Reviewed-by: Dr. David Alan Gilbert <firstname.lastname@example.org>
Signed-off-by: Peter Xu <email@example.com>
Signed-off-by: Juan Quintela <firstname.lastname@example.org>
The fundamental use case here is to recover gracefully from "network
failures" during post-copy migration (which Nova today supports
As of now, Nova doesn't have any action item until libvirt comes up with
an API design (and code) to support postcopy recovery. Then Nova can
work out how to wire things up.
For this RFE bug itself, there's a few pending items in QEMU and
(a) QEMU: Enable the out-of-band (OOB) execution of monitor commands
-- monitor: remove "x-oob", turn oob on by default
(b) libvirt: To create APIs that wire up QEMU's OOB execution feature
and 'migrate-pause' QMP command:
"migration/postcopy: Handle network failures (libvirt)"
* * *
Low-level QEMU context for those interested in such (thanks to Markus
Armburster and Dave Gilbert for the discussion):
Problem: QEMU "monitor" (the API interface between QEMU and libvirt +
management tools) runs in its main loop. And if QEMU's main loop hangs
for whatever reason, the monitor commands (these can be any of the
various QMP / HMP commands) don't get executed until the main loop
One of the ways the main loop can hang is a monitor command taking a
long time to execute -- either because: (a) it does a lot of work
(because it depends on a network connection or might access guest
memory); or (b) because it unexpectedly runs into a blocking system
The idea of OOB execution is to move the QEMU monitor's core
functionality out of the main loop, and into an I/O thread. QEMU
monitor commands still need to be dispatched to the main loop, because
command handlers may have hidden assumptions. But now QEMU allows
special kind of commands that run right in the I/O thread, and may even
"overtake" normal commands -- these are the OOB commands (and the
restrictions on what their handlers can do are severe).
I remain unconvinced this is any kind of priority in practise for 2 reasons:
* network failure which would disrupt LM would more than likely make the VM unavailable anyway.
* the post-copy phase tends to be very short anyway.
Has a customer ever hit this? I can see that it would be tidy to be able to recover this, but if it's never actually going to help anybody in practise we should prioritise it accordingly.
(In reply to Matthew Booth from comment #2)
> I remain unconvinced this is any kind of priority in practise for 2 reasons:
> * network failure which would disrupt LM would more than likely make the VM
> unavailable anyway.
That does depend a bit on the network disruption; a temporary disruption that causes the migration to disconnect/hang might cause a blip for the VMs but in many setups they can recover.
> * the post-copy phase tends to be very short anyway.
> Has a customer ever hit this? I can see that it would be tidy to be able to
> recover this, but if it's never actually going to help anybody in practise
> we should prioritise it accordingly.
I'm not aware of any one who has hit it; however roughly 50% of the people I speak to are scared silly by postcopy because of the possibility; the other half agree that it's short and to get on with their lives.