Bug 1654722 - [OpenStack] [RFE] migration/postcopy: Handle network failures
Summary: [OpenStack] [RFE] migration/postcopy: Handle network failures
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: unspecified
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: nova-maint
QA Contact: nova-maint
URL:
Whiteboard:
Depends On: 1475431 1475305
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-29 14:17 UTC by Jaroslav Suchanek
Modified: 2020-02-23 05:25 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1475431
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Jaroslav Suchanek 2018-11-29 14:17:34 UTC
+++ This bug was initially created as a clone of Bug #1475431 +++

+++ This bug was initially created as a clone of Bug #1475305 +++

Description of problem:
A failure of the migration network during the postcopy phase of migration is fatal; we can't restart the source since the current state is spread between the two hosts.  We want to be able to reconnect the migration stream and complete the migration after someone has fixed the network

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Start a migration
2. switch to postcopy mode
3. take an axe to the network cable
4. wait for TCP timeout
5. Replace network cable

Actual results:
Migration fails after the TCP timeout, the destination VM fails, and can't be restarted.

Expected results:
Some way to recover.

Additional info:

--- Additional comment from Jiri Denemark on 2018-05-30 00:05:59 CEST ---

This is now implemented in QEMU by a series of commits ending with

commit d37297dc66202c33f9cafbc48ccae629e7d6dc31
Refs: v2.12.0-571-gd37297dc66
Author:     Peter Xu <peterx@redhat.com>
AuthorDate: Wed May 2 18:47:40 2018 +0800
Commit:     Juan Quintela <quintela@redhat.com>
CommitDate: Tue May 15 22:13:08 2018 +0200

    migration/hmp: add migrate_pause command

    Wrapper for QMP command "migrate-pause".

    Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Message-Id: <20180502104740.12123-25-peterx@redhat.com>
    Signed-off-by: Juan Quintela <quintela@redhat.com>

Comment 1 Kashyap Chamarthy 2018-11-30 13:52:31 UTC
The fundamental use case here is to recover gracefully from "network 
failures" during post-copy migration (which Nova today supports
`live_migration_permit_post_copy=true`).

As of now, Nova doesn't have any action item until libvirt comes up with  
an API design (and code) to support postcopy recovery.  Then Nova can
work out how to wire things up.

For this RFE bug itself, there's a few pending items in QEMU and
libvirt:

  (a) QEMU: Enable the out-of-band (OOB) execution of monitor commands
      by default:
      http://lists.gnu.org/archive/html/qemu-devel/2018-10/msg06671.html
      -- monitor: remove "x-oob", turn oob on by default

  (b) libvirt: To create APIs that wire up QEMU's OOB execution feature
      and 'migrate-pause' QMP command:

      https://bugzilla.redhat.com/show_bug.cgi?id=1475431#c6 --
      "migration/postcopy: Handle network failures (libvirt)"

    * * *

Low-level QEMU context for those interested in such (thanks to Markus 
Armburster and Dave Gilbert for the discussion):

Problem: QEMU "monitor" (the API interface between QEMU and libvirt +
management tools) runs in its main loop.  And if QEMU's main loop hangs
for whatever reason, the monitor commands (these can be any of the 
various QMP / HMP commands) don't get executed until the main loop
un-hangs.  

One of the ways the main loop can hang is a monitor command taking a
long time to execute -- either because: (a) it does a lot of work 
(because it depends on a network connection or might access guest
memory); or (b) because it unexpectedly runs into a blocking system 
call.

The idea of OOB execution is to move the QEMU monitor's core 
functionality out of the main loop, and into an I/O thread.  QEMU
monitor commands still need to be dispatched to the main loop, because
command handlers may have hidden assumptions.  But now QEMU allows
special kind of commands that run right in the I/O thread, and may even
"overtake" normal commands -- these are the OOB commands (and the
restrictions on what their handlers can do are severe).

Comment 2 Matthew Booth 2018-11-30 14:06:33 UTC
I remain unconvinced this is any kind of priority in practise for 2 reasons:

* network failure which would disrupt LM would more than likely make the VM unavailable anyway.
* the post-copy phase tends to be very short anyway.

Has a customer ever hit this? I can see that it would be tidy to be able to recover this, but if it's never actually going to help anybody in practise we should prioritise it accordingly.

Comment 3 Dr. David Alan Gilbert 2018-11-30 14:16:11 UTC
(In reply to Matthew Booth from comment #2)
> I remain unconvinced this is any kind of priority in practise for 2 reasons:
> 
> * network failure which would disrupt LM would more than likely make the VM
> unavailable anyway.

That does depend a bit on the network disruption; a temporary disruption that causes the migration to disconnect/hang might cause a blip for the VMs but in many setups they can recover.

> * the post-copy phase tends to be very short anyway.
> 
> Has a customer ever hit this? I can see that it would be tidy to be able to
> recover this, but if it's never actually going to help anybody in practise
> we should prioritise it accordingly.

I'm not aware of any one who has hit it; however roughly 50% of the people I speak to are scared silly by postcopy because of the possibility; the other half agree that it's short and to get on with their lives.


Note You need to log in before you can comment on or make changes to this bug.