1404963 – [RFE] fc_host support in virtio-scsi guests, with support for live migration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1404963 - [RFE] fc_host support in virtio-scsi guests, with support for live migration

Summary: [RFE] fc_host support in virtio-scsi guests, with support for live migration

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	7.4
Assignee:	John Ferlan
QA Contact:	yisun
Docs Contact:
URL:
Whiteboard:
Depends On:	1349115 1404962
Blocks:	1349117
TreeView+	depends on / blocked

Reported:	2016-12-15 09:22 UTC by Martin Tessun
Modified:	2018-12-05 21:34 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-12-05 21:34:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Martin Tessun 2016-12-15 09:22:03 UTC

For implementing NPIV adapter passthrough with live migration we need to add the according fc_host support to libvirt:

* fc_host support in virtio-scsi guests, with support for live migration.  With this feature, each virtio-scsi device can switch between two WWPN/WWNN pairs on live migration.

See also https://bugzilla.redhat.com/show_bug.cgi?id=1349115#c10.

Comment 1 Paolo Bonzini 2016-12-15 11:58:57 UTC

This would be a new XML under <controller>, for virtio-scsi only (and hyper-v maybe)

<capability type='fc_host' mode='primary|secondary' state='active|inactive'>
    <wwnn>...</wwnn>
    <wwpn>...</wwpn>
</capability>

mode defaults to primary, state defaults to active.  Only one of the two can be active; if both primary and secondary are present, live migration flips active to inactive and viceversa.

Comment 2 Paolo Bonzini 2016-12-15 12:01:02 UTC

Alternatively, modeled on the storage pool XML:

<source mode='primary|secondary' state='active|inactive'>
  <adapter type='fc_host' wwnn='20000000c9831b4b' wwpn='10000000c9831b4b'/>
</source>

Comment 3 Paolo Bonzini 2016-12-15 12:08:42 UTC

Another thing, if possible the VM would always start with the primary as the active address, so the state attribute would only apply to running VMs and snapshots.

Comment 4 John Ferlan 2016-12-17 14:44:29 UTC

What would the wwnn/wwpn pair for the controller expected to be?

The wwpn/wwnn for the storage pool are the input to the vport_create command for the HBA in order to create the vHBA scsi_host. The parent argument is optional to choose the parent HBA if the host has more than one available. If the patches for bz 1349696 get accepted, then it will be possible to pick the parent by it's wwnn/wwpn or a fabric_wwn as well.

The purpose for the storage pool is to be able to create a vHBA with a "known" wwnn/wwpn pair as opposed to creating a 'transient' vHBA via nodedev-create with a generated wwnn/wwpn pair.

Beyond that though it does seem a bit odd to have the <controller> have the any XML to describe what essentially is or could become the vHBA. Certainly having a <source> element or an <adapter> element just doesn't feel right. I always consider the controller to be more like the 'bus' connecting the various parts together. This would seem to cross that boundary and ensure that only "one type of thing" could connect to that controller. That'd add more <controller> code to ensure other <devices> that don't have an <address> supplied don't choose any <controller> that is being used for vHBA/NPIV. Likewise, any <device> with a <controller bus='n'> where the <controller> bus number matches isn't used for anything else.

The <hostdev> would perhaps be a more appropriate vehicle in this case. Perhaps this is the last paragraph of my initial response to bz 1404962.

One thing that might be possible is to not require the wwnn/wwpn and be more like the nodedev-create to force the code to generate the wwnn/wwpn when the vHBA is created. That way migration can freely create and destroy vHBA's as long as there's some way to uniquely identify which parent is to be used. Using just "parent" isn't good enough since between reboots the parent scsi_hostX can change (e.g. the crux of the patches for bz 1349696).

Another option - if the <hostdev> listed a "fabric_wwn", then creation of a vHBA could be done on any host connected to that fabric. One downside of this is I've been told that not every driver provides it. The upside of using fabric_wwn is that no matter where the vHBA is migrated it could find/create a vHBA on the same fabric - which in the end is what I would think is desired. Whether the to be created vHBA is 'transient' or 'more permanent' could rely on whether a wwnn/wwpn pair is provided or generated (as can happen when using nodedev-create).

As an aside, using primary/secondary would seem to limit migration between two hosts. The real goal is to ensure the same fabric is used. Whether the same wwnn/wwpn is used to create the vHBA perhaps matters to some as a mechanism to find the created vHBA. It's not like the <disk> XML requires it.

Comment 5 Paolo Bonzini 2016-12-21 11:44:59 UTC

> it does seem a bit odd to have the <controller> have the any XML to describe
> what essentially is or could become the vHBA

This part of the proposal was modeled after <interface type='hostdev'>.  It was my understanding (based on past discussion with VFIO) that it was better to customize the element specific to a device type, than to add a device-type-specific schema to <hostdev>.

In this case, sub-elements of <controller> (queues, ioeventfd, iothread, etc.) would all apply to the passed-through HBA, so I chose to extend <controller>.

That said, I think that first of all we need to delimit the scope of the three bugs, and define how they are related.  In particular, this bug is independent from the NPIV passthrough functionality, and only considers passing fc_host info (WWNN/WWPN) to QEMU.  This bug, at least for testing, does _not_ need either target passthrough (bz 1404962) or FC hardware (unlike bz 1404964).

It is of course related to both bugs, and the XML defined here should be defined so that it can be used or extended for those two bugs.

> if the <hostdev> listed a "fabric_wwn", then creation of a vHBA could be
> done on any host connected to that fabric

This is interesting (though again related more to bz 1404964 than this one).  The question is whether such functionality should be baked into the <domain> XML, or whether to leave it to the storage pool instead.  If we leave it to storage pools, it would be a separate bug (separate from bz 1404964 as well).

One note:

> One thing that might be possible is to not require the wwnn/wwpn and be more > like the nodedev-create to force the code to generate the wwnn/wwpn when the 
> vHBA is created

> The upside of using fabric_wwn is that no matter where the vHBA is migrated 
> it could find/create a vHBA on the same fabric - which in the end is what I 
> would think is desired

Because of zoning, you cannot just use a random vHBA on the same fabric; you need to use one with the provided WWNN/WWPN.  There is a pair of primary/secondary WWNN/WWPN per VM, not a WWNN/WWPN per host.

At least, you need to use one of the two provided WWNN/WWPNs.  So if a vHBA already exists on the host for the secondary, I guess in this case it would be okay to start a VM with the secondary.

> As an aside, using primary/secondary would seem to limit migration between 
> two hosts.

No, it doesn't, as long as each migration flips from primary to secondary and back.  So a migration from A to B to C would use primary for A and C, and secondary for B.

Comment 6 John Ferlan 2016-12-21 16:45:12 UTC

I'm stuck on the primary/secondary terminology, the need to define this in/for the controller, and the relationship with the other two bugs. 

If libvirt is going to be managing the creation of the vHBA in the domain using a specific wwnn/wwpn in order to pass through the vHBA LUN's to the domain, then I'm missing what/why the controller needs the wwnn/wwpn. Perhaps that part of the migration magic (something I don't have 'in memory' right now).

I guess I assume now after thinking a bit about this that the migration code would ship over the source domain XML and the target would be able to create the vHBA in the same manner as the source, except using what is being referred to here as secondary wwnn/wwpn would be the primary on the target and the source primary would become the target secondary. 

Maybe things will become clearer the more I start working on/in the code.

The fabric_wwn is something suggested in a different bug as a way to find an HBA based and yes would seemingly not be related to this particular bz.

Not sure it helps, but the existing <disk> XML:

  <disk type='volume' device='disk'>
    <driver name='qemu' type='raw'/>
    <source pool='vhba3_pool_byparent' volume='unit:0:4:0'/>
    <target dev='hda' bus='ide'/>
  </disk>

creates the qemu command options:

  -drive file=/dev/disk/by-path/pci-0000:10:00.0-fc-0x5006016844602198-lun-0,format=raw,if=none,id=drive-ide0-0-0
  -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0

and 

  <disk type='volume' device='lun'>
    <driver name='qemu' type='raw'/>
    <source pool='vhba3_pool_byparent' volume='unit:0:5:0'/>
    <address type='drive' controller='0' bus='0' target='0' unit='0'/>
  </disk>

creates the qemu command options:

  -drive file=/dev/disk/by-path/pci-0000:10:00.0-fc-0x5006016044602198-lun-0,format=raw,if=none,id=drive-scsi0-0-0
  -device scsi-block,bus=scsi0.0,scsi-id=0,drive=drive-scsi0-0-0,id=scsi0-0-0 

Those two paths are just some /dev/sdN device on the host (udev magic).

The <source pool='$NAME'> would I think end up being replaced by a libvirt domain managed scsi_hostN and scsi_target{X|Y|Z}, but the result would be the same - the domain code would need to find a path to the scsi_target luns in order to formulate the qemu command line to "find" the disk/lun based <hostdev> or <vhba> logic.

Comment 7 Martin Tessun 2017-01-27 14:48:52 UTC

Hi John,

(In reply to John Ferlan from comment #6)
> I'm stuck on the primary/secondary terminology, the need to define this
> in/for the controller, and the relationship with the other two bugs. 


I will setup a call, once I am back from my travels, to discuss this with qemu team as well as the RHV team, so we get a common understanding.
Still trying to answer this questions.

> If libvirt is going to be managing the creation of the vHBA in the domain
> using a specific wwnn/wwpn in order to pass through the vHBA LUN's to the
> domain, then I'm missing what/why the controller needs the wwnn/wwpn.
> Perhaps that part of the migration magic (something I don't have 'in memory'
> right now).

I don't think that qemu needs the wwnn/wwpn, but libvirt needs this information for creating the "coorect" NPIV devices on the host.

> I guess I assume now after thinking a bit about this that the migration code
> would ship over the source domain XML and the target would be able to create
> the vHBA in the same manner as the source, except using what is being
> referred to here as secondary wwnn/wwpn would be the primary on the target
> and the source primary would become the target secondary. 

Exactly. This is to avoid 2 times the same wwnn/wwpn on the Fabric. So from a SAN point of view the LUNs/devices need to be presented to both (primary and secondary) wwnn/wwpn pair. Thus the "arget VM can attach the relevant devices for migration.

The important part is that these wwnn/wwpn pairs are always the same and don't change with every migration, but switch from prim. to sec. and with the next migration back to prim.

> Maybe things will become clearer the more I start working on/in the code.
> 
> The fabric_wwn is something suggested in a different bug as a way to find an
> HBA based and yes would seemingly not be related to this particular bz.
> 
> Not sure it helps, but the existing <disk> XML:
> 
>   <disk type='volume' device='disk'>
>     <driver name='qemu' type='raw'/>
>     <source pool='vhba3_pool_byparent' volume='unit:0:4:0'/>
>     <target dev='hda' bus='ide'/>
>   </disk>
> 
> creates the qemu command options:
> 
>   -drive
> file=/dev/disk/by-path/pci-0000:10:00.0-fc-0x5006016844602198-lun-0,
> format=raw,if=none,id=drive-ide0-0-0
>   -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
> 
> and 
> 
>   <disk type='volume' device='lun'>
>     <driver name='qemu' type='raw'/>
>     <source pool='vhba3_pool_byparent' volume='unit:0:5:0'/>
>     <address type='drive' controller='0' bus='0' target='0' unit='0'/>
>   </disk>
> 
> creates the qemu command options:
> 
>   -drive
> file=/dev/disk/by-path/pci-0000:10:00.0-fc-0x5006016044602198-lun-0,
> format=raw,if=none,id=drive-scsi0-0-0
>   -device
> scsi-block,bus=scsi0.0,scsi-id=0,drive=drive-scsi0-0-0,id=scsi0-0-0 
> 
> Those two paths are just some /dev/sdN device on the host (udev magic).
> 
> The <source pool='$NAME'> would I think end up being replaced by a libvirt
> domain managed scsi_hostN and scsi_target{X|Y|Z}, but the result would be
> the same - the domain code would need to find a path to the scsi_target luns
> in order to formulate the qemu command line to "find" the disk/lun based
> <hostdev> or <vhba> logic.

Hm, maybe we can even use the pools logic here and just need to not specify the volume identification to get the "NPIV adapter passthrough" behaviour. Still libvirt would need to create the pools, attach the discovered volumes, monitor changes and promote these to the VM.
Also the migrations need to make sure that the same devices are added back to the VM (so the pools need to be configured to have access to the same SAN devices).

Anyways, let's discuss this together with qemu and RHV team.

Comment 8 Paolo Bonzini 2017-01-31 15:53:08 UTC

> I guess I assume now after thinking a bit about this that the migration code 
> would ship over the source domain XML and the target would be able to create 
> the vHBA in the same manner as the source, except using what is being referred 
> to here as secondary wwnn/wwpn would be the primary on the target and the 
> source primary would become the target secondary. 

Almost... If the virtio-scsi driver needs to expose fc_host parameters in sysfs, QEMU needs to know them.  Because we're modelling the API over what Hyper-V has done before, the primary/secondary wouldn't change, instead QEMU would have an option to pick between the primary and secondary.

The migration code (either source or destination, no idea which is better) would flip the primary to inactive and the secondary to active.

>     <source pool='vhba3_pool_byparent' volume='unit:0:5:0'/>

This is disk information, while wwnn/wwpn is controller information.  But it does look similar to my alternative suggestion in comment 2:

  <source mode='primary|secondary' state='active|inactive'>
    <adapter type='fc_host' wwnn='20000000c9831b4b' wwpn='10000000c9831b4b'/>
  </source>

which would be part of <controller> or (for the case of SCSI target passthrough, bug 1404962, <hostdev>).  You could then have two sources in <hostdev> for the bug 1404962 case:

    <source pool='vhba3_pool_byparent' mode='primary' state='active'/>
    <source pool='vhba4_pool_byparent' mode='secondary' state='inactive'/>

Comment 9 John Ferlan 2017-01-31 20:54:08 UTC

(In reply to Paolo Bonzini from comment #8)
> > I guess I assume now after thinking a bit about this that the migration code 
> > would ship over the source domain XML and the target would be able to create 
> > the vHBA in the same manner as the source, except using what is being referred 
> > to here as secondary wwnn/wwpn would be the primary on the target and the 
> > source primary would become the target secondary. 
> 
> Almost... If the virtio-scsi driver needs to expose fc_host parameters in
> sysfs, QEMU needs to know them.  Because we're modelling the API over what
> Hyper-V has done before, the primary/secondary wouldn't change, instead QEMU
> would have an option to pick between the primary and secondary.
> 

I'm not clear how to interpret the above paragraph.

I am under the impression that the domain vHBA wasn't being passed through to QEMU, but rather the LUNs associated with the vHBA were being added to (or deleted from) the guest just like would happen if someone added/removed a single LUN using the storage pool (via cold or hotplug). The primary difference being that for a domain vHBA the addition/deletion is automagically done via the event notification and hotplug. So while, yes, there is a controller, what I'm missing is how it differs from what already exists for a controller that's being used for storage pool LUN's. That is - is there a "new" parameter that is expected to be passed along to QEMU by these RFE's?  

The <controller> being described here would be required to be the same model controller as the storage pool uses - OK beyond the new fields to describe the adapter parent and the wwnn/wwpn for the vHBA.

> The migration code (either source or destination, no idea which is better)
> would flip the primary to inactive and the secondary to active.
> 
> >     <source pool='vhba3_pool_byparent' volume='unit:0:5:0'/>
> 
> This is disk information, while wwnn/wwpn is controller information.  But it
> does look similar to my alternative suggestion in comment 2:
> 
>   <source mode='primary|secondary' state='active|inactive'>
>     <adapter type='fc_host' wwnn='20000000c9831b4b' wwpn='10000000c9831b4b'/>
>   </source>
> 
> which would be part of <controller> or (for the case of SCSI target
> passthrough, bug 1404962, <hostdev>).  You could then have two sources in
> <hostdev> for the bug 1404962 case:
> 
>     <source pool='vhba3_pool_byparent' mode='primary' state='active'/>
>     <source pool='vhba4_pool_byparent' mode='secondary' state='inactive'/>

The whole point of the example was to show how a LUN is currently sent to qemu using the storage pool.

So after working through the code, augmenting the <controller> would appear to be better than creating a new device type (e.g. <vhba>) whose only purpose is to define the adapter and use the controller.

Rather than primary/secondary mode and active/inactive state - perhaps an attribute "standby" on the controller would be better - although it's not clear yet whether it's necessary. Beyond that having a "<source pool...> in the <controller> would seem to mean there'd need to be a storage pool defined before the controller. While that would somewhat simplify a few things - it would seemingly then require a storage pool to be defined. If that's the case, then why even have a changed controller object - instead have the mgmt software above libvirt listen for the existing "scsi_target" node device events to add/remove LUN's for the guest. Of course knowing which scsi_target is a real LUN would be determined by taking the path of the node dev xml and searching for the "block" device like libvirt does. 

Just pointing out it's already possible to add the LUNs - primarily these RFE's is just making it easier for the mgmt app. 

OK - so back to this primary/secondary thing. I'm still a bit unclear on a few things - I understand the concept, but I'm wondering if it's necessary. 

Does anyone have an environment with two hosts connected to the same HBA? 

Why wouldn't creating a vHBA using a wwnn/wwpn cause all hosts connected to the HBA to receive some udev notification that a new scsi_host exists (e.g. the vHBA) which results in the creation a vHBA on each host connected to the HBA.

After all the wwnn/wwpn is supposed to be unique - so what would preclude (other than our own knowledge) someone using the same wwnn/wwpn to send to the HBA via the "vport_create" function?

If something precludes that, then it would seem "logical" (ok at least to me) that all hosts connected to the HBA would be able to "see" or "know about" the vHBA. If that's the case, then I see no reason for primary/secondary as the vHBA would already be defined.

Back to the code....

Comment 10 John Ferlan 2017-03-07 13:14:15 UTC

Just marking this bz the same as the others with a condnak of design.  I also altered the dependency tree. In the end, the first thing that's needed is a way to automatically add/remove LUN's (e.g. bz 1404964), followed by a way to define a vHBA in the domain/guest (e.g. bz 1404962), and ultimately a mechanism to ensure migration works smoothly.

Comment 13 Ademar Reis 2018-12-05 21:34:04 UTC

This is not the approach we're pursuing anymore, closing the BZ (QEMU BZ is also closed: bug 1349115).

Note You need to log in before you can comment on or make changes to this bug.