+++ This bug was initially created as a clone of Bug #1475431 +++ +++ This bug was initially created as a clone of Bug #1475305 +++ Description of problem: A failure of the migration network during the postcopy phase of migration is fatal; we can't restart the source since the current state is spread between the two hosts. We want to be able to reconnect the migration stream and complete the migration after someone has fixed the network Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Start a migration 2. switch to postcopy mode 3. take an axe to the network cable 4. wait for TCP timeout 5. Replace network cable Actual results: Migration fails after the TCP timeout, the destination VM fails, and can't be restarted. Expected results: Some way to recover. Additional info: --- Additional comment from Jiri Denemark on 2018-05-30 00:05:59 CEST --- This is now implemented in QEMU by a series of commits ending with commit d37297dc66202c33f9cafbc48ccae629e7d6dc31 Refs: v2.12.0-571-gd37297dc66 Author: Peter Xu <peterx> AuthorDate: Wed May 2 18:47:40 2018 +0800 Commit: Juan Quintela <quintela> CommitDate: Tue May 15 22:13:08 2018 +0200 migration/hmp: add migrate_pause command Wrapper for QMP command "migrate-pause". Reviewed-by: Dr. David Alan Gilbert <dgilbert> Signed-off-by: Peter Xu <peterx> Message-Id: <20180502104740.12123-25-peterx> Signed-off-by: Juan Quintela <quintela>
The fundamental use case here is to recover gracefully from "network failures" during post-copy migration (which Nova today supports `live_migration_permit_post_copy=true`). As of now, Nova doesn't have any action item until libvirt comes up with an API design (and code) to support postcopy recovery. Then Nova can work out how to wire things up. For this RFE bug itself, there's a few pending items in QEMU and libvirt: (a) QEMU: Enable the out-of-band (OOB) execution of monitor commands by default: http://lists.gnu.org/archive/html/qemu-devel/2018-10/msg06671.html -- monitor: remove "x-oob", turn oob on by default (b) libvirt: To create APIs that wire up QEMU's OOB execution feature and 'migrate-pause' QMP command: https://bugzilla.redhat.com/show_bug.cgi?id=1475431#c6 -- "migration/postcopy: Handle network failures (libvirt)" * * * Low-level QEMU context for those interested in such (thanks to Markus Armburster and Dave Gilbert for the discussion): Problem: QEMU "monitor" (the API interface between QEMU and libvirt + management tools) runs in its main loop. And if QEMU's main loop hangs for whatever reason, the monitor commands (these can be any of the various QMP / HMP commands) don't get executed until the main loop un-hangs. One of the ways the main loop can hang is a monitor command taking a long time to execute -- either because: (a) it does a lot of work (because it depends on a network connection or might access guest memory); or (b) because it unexpectedly runs into a blocking system call. The idea of OOB execution is to move the QEMU monitor's core functionality out of the main loop, and into an I/O thread. QEMU monitor commands still need to be dispatched to the main loop, because command handlers may have hidden assumptions. But now QEMU allows special kind of commands that run right in the I/O thread, and may even "overtake" normal commands -- these are the OOB commands (and the restrictions on what their handlers can do are severe).
I remain unconvinced this is any kind of priority in practise for 2 reasons: * network failure which would disrupt LM would more than likely make the VM unavailable anyway. * the post-copy phase tends to be very short anyway. Has a customer ever hit this? I can see that it would be tidy to be able to recover this, but if it's never actually going to help anybody in practise we should prioritise it accordingly.
(In reply to Matthew Booth from comment #2) > I remain unconvinced this is any kind of priority in practise for 2 reasons: > > * network failure which would disrupt LM would more than likely make the VM > unavailable anyway. That does depend a bit on the network disruption; a temporary disruption that causes the migration to disconnect/hang might cause a blip for the VMs but in many setups they can recover. > * the post-copy phase tends to be very short anyway. > > Has a customer ever hit this? I can see that it would be tidy to be able to > recover this, but if it's never actually going to help anybody in practise > we should prioritise it accordingly. I'm not aware of any one who has hit it; however roughly 50% of the people I speak to are scared silly by postcopy because of the possibility; the other half agree that it's short and to get on with their lives.