Bug 1654722
| Summary: | [OpenStack] [RFE] migration/postcopy: Handle network failures | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jaroslav Suchanek <jsuchane> |
| Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> |
| Status: | CLOSED WONTFIX | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | unspecified | CC: | chayang, dasmith, dgilbert, dyuan, egallen, eglynn, fjin, jdenemar, jhakimra, juzhang, kchamart, lyarwood, mbooth, peterx, qzhang, sbauza, sgordon, virt-maint, vromanso, xianwang, xuzhang, yalzhang |
| Target Milestone: | --- | Keywords: | FutureFeature, Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1475431 | Environment: | |
| Last Closed: | 2020-09-29 09:41:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1475305, 1475431 | ||
| Bug Blocks: | |||
|
Description
Jaroslav Suchanek
2018-11-29 14:17:34 UTC
The fundamental use case here is to recover gracefully from "network
failures" during post-copy migration (which Nova today supports
`live_migration_permit_post_copy=true`).
As of now, Nova doesn't have any action item until libvirt comes up with
an API design (and code) to support postcopy recovery. Then Nova can
work out how to wire things up.
For this RFE bug itself, there's a few pending items in QEMU and
libvirt:
(a) QEMU: Enable the out-of-band (OOB) execution of monitor commands
by default:
http://lists.gnu.org/archive/html/qemu-devel/2018-10/msg06671.html
-- monitor: remove "x-oob", turn oob on by default
(b) libvirt: To create APIs that wire up QEMU's OOB execution feature
and 'migrate-pause' QMP command:
https://bugzilla.redhat.com/show_bug.cgi?id=1475431#c6 --
"migration/postcopy: Handle network failures (libvirt)"
* * *
Low-level QEMU context for those interested in such (thanks to Markus
Armburster and Dave Gilbert for the discussion):
Problem: QEMU "monitor" (the API interface between QEMU and libvirt +
management tools) runs in its main loop. And if QEMU's main loop hangs
for whatever reason, the monitor commands (these can be any of the
various QMP / HMP commands) don't get executed until the main loop
un-hangs.
One of the ways the main loop can hang is a monitor command taking a
long time to execute -- either because: (a) it does a lot of work
(because it depends on a network connection or might access guest
memory); or (b) because it unexpectedly runs into a blocking system
call.
The idea of OOB execution is to move the QEMU monitor's core
functionality out of the main loop, and into an I/O thread. QEMU
monitor commands still need to be dispatched to the main loop, because
command handlers may have hidden assumptions. But now QEMU allows
special kind of commands that run right in the I/O thread, and may even
"overtake" normal commands -- these are the OOB commands (and the
restrictions on what their handlers can do are severe).
I remain unconvinced this is any kind of priority in practise for 2 reasons: * network failure which would disrupt LM would more than likely make the VM unavailable anyway. * the post-copy phase tends to be very short anyway. Has a customer ever hit this? I can see that it would be tidy to be able to recover this, but if it's never actually going to help anybody in practise we should prioritise it accordingly. (In reply to Matthew Booth from comment #2) > I remain unconvinced this is any kind of priority in practise for 2 reasons: > > * network failure which would disrupt LM would more than likely make the VM > unavailable anyway. That does depend a bit on the network disruption; a temporary disruption that causes the migration to disconnect/hang might cause a blip for the VMs but in many setups they can recover. > * the post-copy phase tends to be very short anyway. > > Has a customer ever hit this? I can see that it would be tidy to be able to > recover this, but if it's never actually going to help anybody in practise > we should prioritise it accordingly. I'm not aware of any one who has hit it; however roughly 50% of the people I speak to are scared silly by postcopy because of the possibility; the other half agree that it's short and to get on with their lives. |