Bug 1404964

Summary:	[RFE] Add Automatic NPIV nodedev creation/deletion
Product:	Red Hat Enterprise Linux 7	Reporter:	Martin Tessun <mtessun>
Component:	libvirt	Assignee:	John Ferlan <jferlan>
Status:	CLOSED WONTFIX	QA Contact:	yisun
Severity:	high	Docs Contact:
Priority:	high
Version:	7.2	CC:	coli, dyuan, jfehlig, jsuchane, michen, pbonzini, rbalakri, xuwei, xuzhang
Target Milestone:	pre-dev-freeze	Keywords:	FutureFeature
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-12-06 17:20:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1349117, 1404962

Description Martin Tessun 2016-12-15 09:29:16 UTC

For implementing NPIV adapter passthroughwe need to add the automatic NPIV nodedev creation/deletion to libvirt:

* Automatic NPIV nodedev creation/deletion, bound to the starting and destroying of guests.  This is libvirt only, and it couples the XML that is added for BZ  1404963 with the existing NPIV nodedev support in libvirt.

Comment 1 Paolo Bonzini 2016-12-15 12:07:30 UTC

This would extend the XML in bug 1404963 with a "managed='yes|no'" attribute, for example:

  <source mode='primary' state='active' managed='yes'>
    <adapter type='fc_host' wwnn='20000000c9831b4b' wwpn='10000000c9831b4b'/>
  </source>
  <source mode='secondary' state='inactive' managed='yes'>
    <adapter type='fc_host' wwnn='20000000fdc531bb' wwpn='10000000fdc531bb'/>
  </source>

An optional attribute "parent" for the adapter element would work the same as for vHBA storage pools.

Another possibility might be to specify the source as a vHBA storage pool, like:

  <source mode='primary' state='active' pool='vhba1'/>
  <source mode='secondary' state='inactive' pool='vhba2'/>

but RHV probably would not use it.

Comment 2 John Ferlan 2016-12-17 15:38:59 UTC

I think as I noted in bz 1404963 - the usage of primary/secondary would seem to imply only two hosts could be used for migration. In the long run, vHBA usage would want to ensure a specific HBA was used.

In any case, I would think that managed would be part of the design for bz 1404963 from the start and not as an add on follow-up. 

Currently it's documented that "For SCSI devices, user is responsible to make sure the device is not used by host.". So perhaps for vHBA's this gets "extended" a bit:

   If managed=no, then it would be expected that the vHBA exists in some manner and domain startup would fail if it couldn't be found.  

   If managed=yes, then a vHBA would be created at domain startup and destroyed at domain shutdown.

The primary reason to use "no" would be if someone had a storage pool which created the vHBA's that used specific wwnn/wwpn pair. Although it wouldn't necessarily matter if the wwnn/wwpn for the vHBA were generated as long as a particular parent was used. There can be many vHBA's per HBA (limited by the "vports" count against "max_vports").

In the long run it matter's what the wwnn/wwpn are meant to be when used from the domain XML. At this point I'm wondering why wwnn/wwpn are required for the storage pool and why they couldn't be generated (different issue though).

Comment 3 Martin Tessun 2016-12-20 17:08:02 UTC

(In reply to John Ferlan from comment #2)
> I think as I noted in bz 1404963 - the usage of primary/secondary would seem
> to imply only two hosts could be used for migration. In the long run, vHBA
> usage would want to ensure a specific HBA was used.

Hm, I think I don't get your point here.
Assuming we have pvWWN and svWWN (primary and secondary WWNN/WWPN) and HostA, HostB and HostC to migrate that VM inbetween:

HostA:
- VM currently active
- using pvWWN

Now migrating to HostB
HostB:
- set up svWWN
- migrate guest
HostA:
- after successful migration destoy pvWWN

Now migrating to HostC
HostC:
- set up pvWWN (being the secondary right now)
- migrate guest
Host B:
- after successful migration destroy svWWN

As I believe the same vWWN canont be used across two servers simultaneously (so each vWWN needs to exist only once at a specific time), we need this switch between these to WWNs.
Please correct me, if I am wrong.
       

> 
> In any case, I would think that managed would be part of the design for bz
> 1404963 from the start and not as an add on follow-up. 
> 

Agreed.

> Currently it's documented that "For SCSI devices, user is responsible to
> make sure the device is not used by host.". So perhaps for vHBA's this gets
> "extended" a bit:
> 
>    If managed=no, then it would be expected that the vHBA exists in some
> manner and domain startup would fail if it couldn't be found.  
> 
>    If managed=yes, then a vHBA would be created at domain startup and
> destroyed at domain shutdown.

Adding here "or in case VM is migrated away"

> 
> The primary reason to use "no" would be if someone had a storage pool which
> created the vHBA's that used specific wwnn/wwpn pair. Although it wouldn't
> necessarily matter if the wwnn/wwpn for the vHBA were generated as long as a
> particular parent was used. There can be many vHBA's per HBA (limited by the
> "vports" count against "max_vports").

Exactly. Also some other management layer (OSP/RHV) could create the ports on their own if needed. But I don't think that they would use a managed=no just to reinvent the wheel ;)

> In the long run it matter's what the wwnn/wwpn are meant to be when used
> from the domain XML. At this point I'm wondering why wwnn/wwpn are required
> for the storage pool and why they couldn't be generated (different issue
> though).

As the presented LUNs and the zoning in the SAN needs to stay intact. So prbably they can be "generated" the first time defining the VM, but as of then they need to stay static.

Comment 4 Paolo Bonzini 2016-12-21 11:49:28 UTC

I agree entirely with everything that Martin said, except that:

> > In any case, I would think that managed would be part of the design for bz
> > 1404963 from the start and not as an add on follow-up. 

... managed is not part of bz 1404963 because managed='no' matters not only if some other layer creates the ports(*) but also because for testing you could create a fc_host virtio-scsi controller with emulated devices.  Of course this configuration makes zero sense in production.

(*) regarding this, I think neither OSP nor RHV are using storage pools, so I
    do think they'd reinvent the wheel (just like they don't even use
    persistent domains).  Nevertheless, I think managed='yes' makes sense
    at the libvirt level to complete integration of domains with storage pools.

Comment 5 John Ferlan 2016-12-21 17:24:02 UTC

The need for a defined wwnn/wwpn for a vHBA and zoning is a detail that I don't keep fresh in memory... But yes, with that in mind I can see the reasoning behind two defined wwnn/wwpn's as opposed to what I had in memory being able to generate a wwnn/wwpn (like the nodedev-create code can). Still, something to document though, e.g. why wwnn/wwpn are "required" or "optional".

The primary/secondary just seems awkward - it's a notation which makes me think of failover capabilities. If they're to be used for migration, then perhaps we need a better name/syntax to call that out.

I have tried to put together a libvirt test for vHBA/NPIV utilizing the existing infrastructure, but the udev event processing continues to present obstacles and it just hasn't been a high priority.

There is a sort of chicken/egg problem between udev notification and qemu hotplug that'll need to be worked out with 1404963. The infrastructure could exist prior to it actually doing anything. Doing anything requires the ability actually find and hotplug the device that was found.

Keeping track of things for libvirtd restarts will be a challenge too. The restart code would need to essentially mimic what pool-refresh does except that it needs to be smarter when traversing the tree and be able to distinguish between existing devices, new devices, and deleted devices since the last start within the "active" domain XML. Then of course issue the appropriate qmp commands as a result. Thus, it seems the active domain is going to need to keep track of udev's seen over time, but that won't be written out to the config domain XML...

Comment 6 Paolo Bonzini 2016-12-22 08:52:56 UTC

> There is a sort of chicken/egg problem between udev notification and qemu 
> hotplug that'll need to be worked out with 1404963

Did you mean bug 1404962?

> Thus, it seems the active domain is going to need to keep track of udev's seen 
> over time, but that won't be written out to the config domain XML...

Can you use query-block to get this from QEMU?  In any case, I think hotplug should be a separate bug, and 7.4 cond-nak capacity.

Comment 7 John Ferlan 2016-12-22 12:03:45 UTC

(In reply to Paolo Bonzini from comment #6)
> > There is a sort of chicken/egg problem between udev notification and qemu 
> > hotplug that'll need to be worked out with 1404963
> 
> Did you mean bug 1404962?

Yes

> 
> > Thus, it seems the active domain is going to need to keep track of udev's seen 
> > over time, but that won't be written out to the config domain XML...
> 
> Can you use query-block to get this from QEMU?  In any case, I think hotplug
> should be a separate bug, and 7.4 cond-nak capacity.

That last paragraph wasn't about hotplug - without writing any code at this point I'm assuming LUN presentation is going all hotplug. The issue being the "time" it takes between writing wwnn:wwpn to the HBA vport_create and the "wait" for udev to do it's magic *plus* the time to make sure the data in the vHBA files is correct (read this gem - bz 1319544) and then the time it takes for udev to finish populating the LUN's in the target directory (storage pools essentially survey /dev/disk/{by-path|by-uuid|by-id} looking for details). There's no way for domain startup to determine whether udev is done - so we cannot wait on something prior to creating the qemu command line that'll be used to start the guest.  IOW: All this code will be some asynchronous domain thread/job.

The last paragraph was a reminder (mostly to myself) that this code will have to deal with the libvirtd stop/restart where we have to read the active domain XML's in the hypervisor state directory in order to reset libvirtd state back up (for a privileged domain /var/run/libvirt/qemu/$DOMAIN.xml). In libvirt terms it's the qemuProcessReconnect processing via qemuProcessReconnectAll. For a storage pool, processing a pool refresh (e.g. LUN lookup) is accomplished by essentially clearing the internal pool state and performing a new/fresh search. It's the classic choice of rather than determining if the next object you found already exists before deciding what to do - just remove the object and repopulate your list. For libvirtd restart processing that cannot be done...

We'll see if query-block would work. I'm not sure at this point as I suspect libvirt needs to store more information about the device than what gets sent to QEMU.

Comment 8 Paolo Bonzini 2016-12-23 09:10:24 UTC

> We'll see if query-block would work. I'm not sure at this point as I suspect 
> libvirt needs to store more information about the device than what gets sent 
> to QEMU.

Indeed.  Maybe you could store it in advance, and then use query-block to confirm that the QEMU device is actually there.

Comment 9 John Ferlan 2017-03-07 13:01:38 UTC

Some patches were posted upstream:

http://www.redhat.com/archives/libvir-list/2017-February/msg01091.html

that were essentially the basis for the planned mechanism to support adding/removing LUNs to a domain. However, those patches or the concept behind them have not been accepted.

In fact the followup dialogue regarding what the patches do and the ultimate goal for the related bzs 1404962 and 1404963 indicates that at an architectural level the desire is to not have 1 device (e.g. vHBA) turn into many devices (e.g. LUNs) which all need tracking. From that same architectural viewpoint, it would seem the desire is that the mgmt apps using libvirt would handle adding/removing LUN's when the node device events are seen.  How they set up the vHBA (via nodedev or storage pool) is up to them.

I've changed the condnak to design... It may be that all these related bz's get closed indicating this type of work isn't going to be done in libvirt.

Comment 11 John Ferlan 2017-10-04 11:37:08 UTC

The kernel and QEMU BZs have been deferred to RHEL-7.6 as the design is still
under discussion.

Comment 13 John Ferlan 2018-12-06 17:20:03 UTC

Based on email discussion, closing this as WONTFIX since the storage/HBA vendors did not want to invest more in the NPIV/vHBA infrastructure.

A new methodology is "under discussion".