Bug 1654722

Summary:	[OpenStack] [RFE] migration/postcopy: Handle network failures
Product:	Red Hat OpenStack	Reporter:	Jaroslav Suchanek <jsuchane>
Component:	openstack-nova	Assignee:	OSP DFG:Compute <osp-dfg-compute>
Status:	CLOSED WONTFIX	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	unspecified	CC:	chayang, dasmith, dgilbert, dyuan, egallen, eglynn, fjin, jdenemar, jhakimra, juzhang, kchamart, lyarwood, mbooth, peterx, qzhang, sbauza, sgordon, virt-maint, vromanso, xianwang, xuzhang, yalzhang
Target Milestone:	---	Keywords:	FutureFeature, Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1475431	Environment:
Last Closed:	2020-09-29 09:41:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1475305, 1475431
Bug Blocks:

Description Jaroslav Suchanek 2018-11-29 14:17:34 UTC

+++ This bug was initially created as a clone of Bug #1475431 +++

+++ This bug was initially created as a clone of Bug #1475305 +++

Description of problem:
A failure of the migration network during the postcopy phase of migration is fatal; we can't restart the source since the current state is spread between the two hosts.  We want to be able to reconnect the migration stream and complete the migration after someone has fixed the network

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Start a migration
2. switch to postcopy mode
3. take an axe to the network cable
4. wait for TCP timeout
5. Replace network cable

Actual results:
Migration fails after the TCP timeout, the destination VM fails, and can't be restarted.

Expected results:
Some way to recover.

Additional info:

--- Additional comment from Jiri Denemark on 2018-05-30 00:05:59 CEST ---

This is now implemented in QEMU by a series of commits ending with

commit d37297dc66202c33f9cafbc48ccae629e7d6dc31
Refs: v2.12.0-571-gd37297dc66
Author:     Peter Xu <peterx>
AuthorDate: Wed May 2 18:47:40 2018 +0800
Commit:     Juan Quintela <quintela>
CommitDate: Tue May 15 22:13:08 2018 +0200

    migration/hmp: add migrate_pause command

    Wrapper for QMP command "migrate-pause".

    Reviewed-by: Dr. David Alan Gilbert <dgilbert>
    Signed-off-by: Peter Xu <peterx>
    Message-Id: <20180502104740.12123-25-peterx>
    Signed-off-by: Juan Quintela <quintela>

Comment 1 Kashyap Chamarthy 2018-11-30 13:52:31 UTC

The fundamental use case here is to recover gracefully from "network 
failures" during post-copy migration (which Nova today supports
`live_migration_permit_post_copy=true`).

As of now, Nova doesn't have any action item until libvirt comes up with  
an API design (and code) to support postcopy recovery.  Then Nova can
work out how to wire things up.

For this RFE bug itself, there's a few pending items in QEMU and
libvirt:

  (a) QEMU: Enable the out-of-band (OOB) execution of monitor commands
      by default:
      http://lists.gnu.org/archive/html/qemu-devel/2018-10/msg06671.html
      -- monitor: remove "x-oob", turn oob on by default

  (b) libvirt: To create APIs that wire up QEMU's OOB execution feature
      and 'migrate-pause' QMP command:

      https://bugzilla.redhat.com/show_bug.cgi?id=1475431#c6 --
      "migration/postcopy: Handle network failures (libvirt)"

    * * *

Low-level QEMU context for those interested in such (thanks to Markus 
Armburster and Dave Gilbert for the discussion):

Problem: QEMU "monitor" (the API interface between QEMU and libvirt +
management tools) runs in its main loop.  And if QEMU's main loop hangs
for whatever reason, the monitor commands (these can be any of the 
various QMP / HMP commands) don't get executed until the main loop
un-hangs.  

One of the ways the main loop can hang is a monitor command taking a
long time to execute -- either because: (a) it does a lot of work 
(because it depends on a network connection or might access guest
memory); or (b) because it unexpectedly runs into a blocking system 
call.

The idea of OOB execution is to move the QEMU monitor's core 
functionality out of the main loop, and into an I/O thread.  QEMU
monitor commands still need to be dispatched to the main loop, because
command handlers may have hidden assumptions.  But now QEMU allows
special kind of commands that run right in the I/O thread, and may even
"overtake" normal commands -- these are the OOB commands (and the
restrictions on what their handlers can do are severe).

Comment 2 Matthew Booth 2018-11-30 14:06:33 UTC

I remain unconvinced this is any kind of priority in practise for 2 reasons:

* network failure which would disrupt LM would more than likely make the VM unavailable anyway.
* the post-copy phase tends to be very short anyway.

Has a customer ever hit this? I can see that it would be tidy to be able to recover this, but if it's never actually going to help anybody in practise we should prioritise it accordingly.

Comment 3 Dr. David Alan Gilbert 2018-11-30 14:16:11 UTC

(In reply to Matthew Booth from comment #2)
> I remain unconvinced this is any kind of priority in practise for 2 reasons:
> 
> * network failure which would disrupt LM would more than likely make the VM
> unavailable anyway.

That does depend a bit on the network disruption; a temporary disruption that causes the migration to disconnect/hang might cause a blip for the VMs but in many setups they can recover.

> * the post-copy phase tends to be very short anyway.
> 
> Has a customer ever hit this? I can see that it would be tidy to be able to
> recover this, but if it's never actually going to help anybody in practise
> we should prioritise it accordingly.

I'm not aware of any one who has hit it; however roughly 50% of the people I speak to are scared silly by postcopy because of the possibility; the other half agree that it's short and to get on with their lives.