Bug 896716

Summary:	[RFE] support migration with PCI passthrough network devices
Product:	Red Hat Enterprise Linux 7	Reporter:	Laine Stump <laine>
Component:	libvirt	Assignee:	Laine Stump <laine>
Status:	CLOSED CANTFIX	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.0	CC:	berrange, cwei, dayleparker, dyuan, jishao, jsuchane, juzhang, kyulee, lnovich, mzhan, rbalakri, sherold, trichard, xuzhang, ydu, zpeng
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-05-26 15:09:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1205796

Description Laine Stump 2013-01-17 20:32:07 UTC

It is currently not possible to migrate a guest that has a device assigned with PCI passthrough, because the hardware device itself contains state that cannot be migrated.

This is problematic, in particular, in cases where a guest is using a pci passthrough network device - currently such a guest is stuck on a single host for its entire lifetime.

Solarflare has some patches that add a new "ephemeral" attribute to <hostdev> and <interface type='hostdev'> devices. When ephemeral='yes', a device is automatically detached from the guest when migration is started, and re-attached on the target when migration is complete. Of course, this by itself is not sufficient for a useful setup:

1) the source and target of the migration very likely won't have exactly the same hardware available at exactly the same PCI address.

2) whatever subsystem uses the passed-through device will be disrupted during the time of the migration.

(1) is solved by indirectly obtaining the physical hardware's PCI address via a libvirt <network> that has <forward mode='hostdev'> accompanied by a list of PCI devices available on a particular machine. The domain will have:

   <interface type='network'>
     <source network='blah'/>
     ...
   </interface>

and both the source and destination hosts will have:

   <network>
      <name>blah</name>
      <forward mode='hostdev' managed='yes'>
        <address type='pci' domain='0' bus='4' slot='0' function='1'/>
        <address type='pci' domain='0' bus='4' slot='0' function='2'/>
        <address type='pci' domain='0' bus='4' slot='0' function='3'/>
        <!-- etc. or whatever is available on this host -->
      </forward>
   </network>

At initial device attach time, libvirt will pick a device from the list in the network, and attach that device. (this part is already present in libvirt)

The interesting part comes when you add ephemeral='yes' to the <forward> definition. This marks devices assigned in this way as ephemeral, so they are detached at the beginning of migration, then on the destination a new (currently unused) device is selected from the destination host's 'blah' network, and that device is attached to the guest.

There is still problem (2) though. That problem is solved externally - the guest must have a bond device that contains two network devices:

a) the physical adapter assigned via pci passthrough that we've been discussing
b) a virtio-net adapter attached to some other physical adapter via macvtap

The bond device will be setup such that adapter (a) is chosen if it's present, but adapter (b) will be used when (a) is removed. In this way, network connectivity will be maintained (via the macvtap device) during migration, then switch back to the faster pci passthrough device when migration is completed.

The most recent version of Solarflare's patches are here:

  https://www.redhat.com/archives/libvir-list/2012-November/msg01324.html

and are reasonably close to going in upstream.

Comment 7 Rogan Kyuseok Lee 2015-01-28 09:32:17 UTC

 If the same model of NIC devices are installed in the same PCI slot of the source and destination hosts and the NIC supports SR-IOV, is it technically possible to migrate a vm with a PCI passthorugh network device between hosts?

 What other possible facts that can interference this feature working you can think? 

Thanks in advance,
rogan

Comment 8 Laine Stump 2015-01-28 20:43:05 UTC

The description of this BZ states the obstacles fairly clearly, although the very first sentence could use more emphasis. It is technically impossible to migrate a guest that has any PCI passthrough devices attached, since the hardware itself contains state that qemu cannot know and therefore cannot migrate.

The method used by Solarflare was to automate the process of detaching the passed-through device before migration was started, and then attach a new device once migration was complete and the guest started up on the destination. This obviously requires cooperation from the guest, as it must be able to deal with having the device temporarily removed. The way that they solved this was to bond together the passed-through device together with an emulated network device connected to the same physical net; during migration performance would be degraded but at least everything would still work.

Since, from the guest's point of view, it is not the same device on the source host and destination host, it turns out that the actual hardware on the two hosts does not have to be identical - as long as the mac address is set the same, and the guest OS supports hotplug/unplug and can deal with two different drivers using the same MAC address at different times that is.

In the end it isn't simple (i.e. will take significant development time to get it right), and won't work with "just any" guest. That's why it still hasn't been done (at least not in a general way).

Comment 13 Laine Stump 2015-05-26 15:09:43 UTC

There has been a lot of discussion on this topic in two threads initiated with patches from Chen Fan <chen.fan.fnst.com>

  https://www.redhat.com/archives/libvir-list/2015-April/msg00803.html

(that thread has continued into the following month, which may require searching for the subject in the May archives)

and

  https://www.redhat.com/archives/libvir-list/2015-May/msg00384.html

The short form of all this from libvirt's point of view is that libvirt is not in the correct position to automatically do the detach and re-attach of the devices based on config options. I had previously been a proponent of that design, but the two mind-changing points for me were:

1) the case where a guest has been migrated to the destination host and CPUs re-started, but the auto-reattach fails. In this case, libvirt cannot report success up to the management layer, because the guest is not in the same pre-migration state (modulo the move to a new host), but it also can't report failure to migrate, because that would indicate that the guest is still running on the original host. There are of course several ways this could be handled, but it's impossible for libvirt to pick a single recovery scheme that would work for everyone. There are several other items that would need configuring as well, for example the maximum amount of time to wait for the device detach to complete. By the time libvirt provides configuration options for all these, and the management (e.g. OpenStack or ovirt) sets up the configuration, it would be just as simple for management to itself issue the libvirt commands to detach the device on the source and reattach on the destination after migration is completed.

2) The only way that libvirt can usefully manage auto-detach and re-attach of devices is by utilizing its network device pools (a network with <forward mode='hostdev'>) - this is needed so that libvirt has a method of finding an equivalent device on the destination host (since it's impractical/unrealistic to expect that the device at the exact same PCI address as that on the source host will also be available on the destination). However, OpenStack *never* uses libvirt's network device pool; it has its own method of determining which device to use, outside the scope of libvirt. This means that even if libvirt did implement some sort of auto-detach/reattach, it would be unusable by OpenStack.

I'm closing the BZ. Feel free to re-open or open a new BZ requesting a modified form of participation by libvirt in this functionality.