Bug 1657468

Summary:	different behaviors when create storage pool via NPIV in RHEL7.5 and RHEL7.6
Product:	Red Hat Enterprise Linux 7	Reporter:	Adam Xu <xingya.xu>
Component:	libvirt	Assignee:	John Ferlan <jferlan>
Status:	CLOSED ERRATA	QA Contact:	yisun
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.6	CC:	gveitmic, hhan, jdenemar, jferlan, meili, mkalinin, sirao, xingya.xu, xuzhang, yalzhang
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	libvirt-4.5.0-11.el7	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1687715 (view as bug list)		Environment:
Last Closed:	2019-08-06 13:14:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1687715

Description Adam Xu 2018-12-08 16:27:29 UTC

Description of problem:

I have a HPE 3PAR FC-SAN storage with 4 controllers and 2 FC switch.
and I have 2 standalone kvm servers.
when I upgrade one of the kvm servers to RHEL7.6, I found that one vm using NPIV could not start. It throw out error like "unit:0:3:0 is missing" 
I check the status of the storage pool:
# virsh pool-list --all
 Name                 State      Autostart 
-------------------------------------------
 default              active     yes       
 iso                  active     yes       
 vhbapool_host1       active     yes       
 vhbapool_host12      active     yes  

check the luns:
# virsh vol-list vhbapool_host1
 Name                 Path                                    
------------------------------------------------------------------------------
 unit:0:0:0           /dev/disk/by-path/pci-0000:05:00.0-vport-0x5001a4a000000001-fc-0x21010002ac01e23d-lun-0

the unit:0:3:0 has disappeared. when in rhel 7.5 with libvirt 3.9,it should shows like:
# virsh vol-list vhbapool_host1
 Name                 Path                                    
------------------------------------------------------------------------------
 unit:0:0:0           /dev/disk/by-path/pci-0000:05:00.0-vport-0x5001a4ad261be7ed-fc-0x21010002ac01e23d-lun-0
 unit:0:1:0           /dev/disk/by-path/pci-0000:05:00.0-vport-0x5001a4ad261be7ed-fc-0x22010002ac01e23d-lun-0
 unit:0:2:0           /dev/disk/by-path/pci-0000:05:00.0-vport-0x5001a4ad261be7ed-fc-0x23010002ac01e23d-lun-0
 unit:0:3:0           /dev/disk/by-path/pci-0000:05:00.0-vport-0x5001a4ad261be7ed-fc-0x20010002ac01e23d-lun-0

of course, I can edit the vm and let it start.
So,I just want to know that is this a new behavior in libvirt 4.5, or just a bug?

Version-Release number of selected component (if applicable):
libvirtd 3.9 in RHEL 7.5 
libvirtd 4.5 in RHEL 7.6

How reproducible:
do the same operation in RHEL 7.5 and 7.6, the storage pool will show 4 luns every vhba in 7.5 and only one lun in 7.6.

Steps to Reproduce:
1.list HBA devices
# virsh nodedev-list --cap vports
2.create vhba like:
# cat vhba_host1.xml
<device>
<parent>scsi_host1</parent>
<capability type='scsi_host'>
<capability type='fc_host'>
</capability>
</capability>
</device>

# virsh nodedev-create vhba_host1.xml
3.show wwnn and wwpn id of the vhba
# virsh nodedev-dumpxml scsi_host13
...
      <wwnn>5001a4a695e4124d</wwnn>
      <wwpn>5001a4ad261be7ed</wwpn>
...
4.create  storage pool:
# cat vhbapool_host1.xml
...
  	<source>
		<adapter type='fc_host'  wwnn='5001a4a695e4124d' wwpn='5001a4ad261be7ed'/>
	</source>
...

# virsh pool-define vhbapool_host1.xml
it will succeed in rhel 7.5. but failed in rhel 7.6. errors like "the wwnn and wwpn id have been used by a hba device"
I have to reboot the kvm server and run that command again, this time, it succeed.

5.show  luns in the storage pool:
# virsh vol-list vhbapool_host1
there will be 4 unit in rhel 7.5 and only one unit in rhel 7.6. as showed in above. 

Actual results:
there will be 4 unit in rhel 7.5 and only one unit in rhel 7.6. as showed in above. 

Expected results:
both should be 4 units in every storage pool.


Additional info:

Comment 2 yisun 2018-12-11 04:06:43 UTC

Hi Adam,
For the issue you met, pls help to check following stuff:
1. How's the "multipath -ll" shows?
in my 7.6 env, things as follow: (scsi_host13 is the vHBA)
# lsscsi
....
[13:0:0:0]   disk    IBM      1726-4xx  FAStT  0617  /dev/sdd 
[13:0:0:1]   disk    IBM      1726-4xx  FAStT  0617  /dev/sde 
[13:0:1:0]   disk    IBM      1726-4xx  FAStT  0617  /dev/sdf 
[13:0:1:1]   disk    IBM      1726-4xx  FAStT  0617  /dev/sdg 

# multipath -ll
mpathd (3600a0b80005b10ca00005e115729093f) dm-14 IBM     ,1726-4xx  FAStT 
size=10G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=6 status=active
| `- 13:0:0:1 sde 8:64 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 13:0:1:1 sdg 8:96 active ghost running
mpathc (3600a0b80005b0acc00004f875728fe8e) dm-13 IBM     ,1726-4xx  FAStT 
size=10G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=6 status=active
| `- 13:0:1:0 sdf 8:80 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 13:0:0:0 sdd 8:48 active ghost running

# virsh vol-list test
 Name                 Path                                    
------------------------------------------------------------------------------
 unit:0:0:1           /dev/disk/by-path/pci-0000:95:00.0-vport-0x2101001b32a90000-fc-0x203500a0b85b0acc-lun-1
 unit:0:1:0           /dev/disk/by-path/pci-0000:95:00.0-vport-0x2101001b32a90000-fc-0x203400a0b85b0acc-lun-0
<== so only path which is "ready running" shown in vol list if they're just multipath devices pointing to a same backend lun.


2. For your step 4, it's a expected change.
Actually when start a npiv pool, if wwnn/wwpn indicated, libvirt will try to create a vhba with that wwpn/wwnn. So existing wwnn/wwpn will block the process, an error reported. You don't need to reboot kvm server, you can "nodedev-destroy" the existing vhba and then define/start the pool.

Comment 3 Adam Xu 2018-12-11 07:58:00 UTC

In the RHEL 7.6 server, it shows
# multipath -ll
mpatha (360002ac000000000000000270001e23d) dm-3 3PARdata,VV              
size=256G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 13:0:0:0 sdb 8:16  active ready running
  |- 13:0:1:0 sdc 8:32  active ready running
  |- 13:0:2:0 sdd 8:48  active ready running
  |- 13:0:3:0 sde 8:64  active ready running
  |- 14:0:0:0 sdf 8:80  active ready running
  |- 14:0:1:0 sdg 8:96  active ready running
  |- 14:0:2:0 sdh 8:112 active ready running
  `- 14:0:3:0 sdi 8:128 active ready running

in another rhel 7.5 server, it shows:
[root@kvm2 ~]# multipath -ll
mpatha (360002ac0000000000000005b0001e23d) dm-3 3PARdata,VV              
size=256G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 13:0:0:0 sdb 8:16  active ready running
  |- 13:0:1:0 sdd 8:48  active ready running
  |- 13:0:2:0 sdf 8:80  active ready running
  |- 13:0:3:0 sdi 8:128 active ready running
  |- 14:0:0:0 sdc 8:32  active ready running
  |- 14:0:1:0 sde 8:64  active ready running
  |- 14:0:2:0 sdg 8:96  active ready running
  `- 14:0:3:0 sdh 8:112 active ready running

almost same in 7.5 and 7.6.
 
# virsh vol-list poolname
in RHEL 7.6, it shows:
Name                 Path                                    
------------------------------------------------------------------------------
 unit:0:0:0           /dev/disk/by-path/pci-0000:05:00.0-vport-0x5001a4a000000001-fc-0x21010002ac01e23d-lun-0

in another 7.5, it shows：
 Name                 Path                                    
------------------------------------------------------------------------------
 unit:0:0:0           /dev/disk/by-path/pci-0000:04:00.0-vport-0x5001a4ac1102f84d-fc-0x21010002ac01e23d-lun-0
 unit:0:1:0           /dev/disk/by-path/pci-0000:04:00.0-vport-0x5001a4ac1102f84d-fc-0x22010002ac01e23d-lun-0
 unit:0:2:0           /dev/disk/by-path/pci-0000:04:00.0-vport-0x5001a4ac1102f84d-fc-0x23010002ac01e23d-lun-0
 unit:0:3:0           /dev/disk/by-path/pci-0000:04:00.0-vport-0x5001a4ac1102f84d-fc-0x20010002ac01e23d-lun-0

because of the changes, a vm with vhba will boot with failure. error like "unit:0:3:0 missing".

Comment 4 yisun 2018-12-11 09:49:07 UTC

(In reply to Adam Xu from comment #3)
> In the RHEL 7.6 server, it shows
> # multipath -ll

This is quite like a intentional change, but I do not have a exactly same environment to debug this.
Could you pls attach the libvirtd log by following steps?

1. turn on libvirtd debug log by edit conf:
#  vim /etc/libvirt/libvirtd.conf
log_outputs="1:file:/var/log/libvirtd-debug.log"
log_level=1

2. restart libvirtd to enable the debug log:
# service libvirtd restart
Redirecting to /bin/systemctl restart libvirtd.service

3. clear your env to make sure there is no existing npiv pool or vhbas.


4. clear the debug log by:
# echo "" > /var/log/libvirtd-debug.log

5. create/start the npiv pool again

6. upload the debug log /var/log/libvirtd-debug.log here in bugzilla

=================================
For example in my env, the log for the ghost running luns will be something as follow:
2018-12-11 09:45:54.673+0000: 22615: debug : virStorageBackendSCSIFindLUs:4125 : Found possible LU '16:0:0:0'
2018-12-11 09:45:54.673+0000: 22615: debug : processLU:4054 : Processing LU 16:0:0:0
2018-12-11 09:45:54.673+0000: 22615: debug : getDeviceType:4025 : Device type is 0
2018-12-11 09:45:54.673+0000: 22615: debug : processLU:4071 : 16:0:0:0 is a Direct-Access LUN
2018-12-11 09:45:54.673+0000: 22615: debug : getNewStyleBlockDevice:3852 : Looking for block device in '/sys/bus/scsi/devices/16:0:0:0/block'
2018-12-11 09:45:54.673+0000: 22615: debug : getNewStyleBlockDevice:3861 : Block device is 'sdd'
2018-12-11 09:45:54.673+0000: 22615: debug : virStorageBackendSCSINewLun:3788 : Trying to create volume for '/dev/sdd'
2018-12-11 09:45:55.205+0000: 22615: warning : virStorageBackendDetectBlockVolFormatFD:1463 : ignoring failed saferead of file '/dev/disk/by-path/pci-0000:95:00.0-vport-0x2101001b32a90000-fc-0x203500a0b85b0acc-lun-0'
2018-12-11 09:45:55.205+0000: 22615: debug : virFileClose:111 : Closed fd 24
2018-12-11 09:45:55.205+0000: 22615: debug : processLU:4082 : Failed to create new storage volume for 16:0:0:0

Comment 5 Adam Xu 2018-12-12 02:48:23 UTC

some like that:
===================================
2018-12-12 02:34:46.205+0000: 17898: debug : virFileClose:111 : Closed fd 22
2018-12-12 02:34:46.205+0000: 17898: debug : virFileClose:111 : Closed fd 25
2018-12-12 02:34:46.205+0000: 17898: debug : virStorageBackendSCSIFindLUs:4125 : Found possible LU '10:0:0:0'
2018-12-12 02:34:46.205+0000: 17898: debug : processLU:4054 : Processing LU 10:0:0:0
2018-12-12 02:34:46.205+0000: 17898: debug : getDeviceType:4025 : Device type is 0
2018-12-12 02:34:46.205+0000: 17898: debug : processLU:4071 : 10:0:0:0 is a Direct-Access LUN
2018-12-12 02:34:46.205+0000: 17898: debug : getNewStyleBlockDevice:3852 : Looking for block device in '/sys/bus/scsi/devices/10:0:0:0/block'
2018-12-12 02:34:46.205+0000: 17898: debug : getNewStyleBlockDevice:3861 : Block device is 'sdb'
2018-12-12 02:34:46.206+0000: 17898: debug : virStorageBackendSCSINewLun:3788 : Trying to create volume for '/dev/sdb'
2018-12-12 02:34:46.207+0000: 17898: debug : virStorageBackendDetectBlockVolFormatFD:1485 : cannot determine the target format for '/dev/disk/by-path/pci-0000:04:00.1-vport-0x5001a4a000000007-fc-0x20010002ac01e23d-lun-0'
2018-12-12 02:34:46.207+0000: 17898: debug : virFileClose:111 : Closed fd 22
2018-12-12 02:34:46.207+0000: 17898: debug : virCommandRunAsync:2476 : About to run /lib/udev/scsi_id --replace-whitespace --whitelisted --device /dev/disk/by-path/pci-0000:04:00.1-vport-0x5001a4a000000007-fc-0x20010002ac01e23d-lun-0
2018-12-12 02:34:46.208+0000: 17898: debug : virFileClose:111 : Closed fd 22
2018-12-12 02:34:46.208+0000: 17898: debug : virFileClose:111 : Closed fd 25
2018-12-12 02:34:46.208+0000: 17898: debug : virFileClose:111 : Closed fd 27
2018-12-12 02:34:46.208+0000: 17898: debug : virCommandRunAsync:2479 : Command result 0, with PID 18044
2018-12-12 02:34:46.227+0000: 17898: debug : virCommandRun:2327 : Result status 0, stdout: '360002ac000000000000005860001e23d
' stderr: '2018-12-12 02:34:46.216+0000: 18044: debug : virFileClose:111 : Closed fd 25
2018-12-12 02:34:46.216+0000: 18044: debug : virFileClose:111 : Closed fd 27
2018-12-12 02:34:46.216+0000: 18044: debug : virFileClose:111 : Closed fd 22
=====================================
full log can be got from ms onedrive:
https://1drv.ms/u/s!AgdSC5Ad4nVqx1oDlZljFFOP2pwa

Comment 6 Adam Xu 2018-12-12 02:56:16 UTC

I forgot to add something.
in the above example. I created one pool and when I run 
# lsblk
there are extra four block devices can be found.
---------------------------------
sdb               8:16   0    60G  0 disk 
sdc               8:32   0    60G  0 disk 
sdd               8:48   0    60G  0 disk 
sde               8:64   0    60G  0 disk 
----------------------------------

Comment 7 yisun 2018-12-12 04:54:37 UTC

(In reply to Adam Xu from comment #5)
> some like that:
> ===================================
> 2018-12-12 02:34:46.205+0000: 17898: debug : virFileClose:111 : Closed fd 22
> 2018-12-12 02:34:46.205+0000: 17898: debug : virFileClose:111 : Closed fd 25
> 2018-12-12 02:34:46.205+0000: 17898: debug :
> virStorageBackendSCSIFindLUs:4125 : Found possible LU '10:0:0:0'
> 2018-12-12 02:34:46.205+0000: 17898: debug : processLU:4054 : Processing LU
> 10:0:0:0
> 2018-12-12 02:34:46.205+0000: 17898: debug : getDeviceType:4025 : Device
> type is 0
> 2018-12-12 02:34:46.205+0000: 17898: debug : processLU:4071 : 10:0:0:0 is a
> Direct-Access LUN
> 2018-12-12 02:34:46.205+0000: 17898: debug : getNewStyleBlockDevice:3852 :
> Looking for block device in '/sys/bus/scsi/devices/10:0:0:0/block'
> 2018-12-12 02:34:46.205+0000: 17898: debug : getNewStyleBlockDevice:3861 :
> Block device is 'sdb'
> 2018-12-12 02:34:46.206+0000: 17898: debug :
> virStorageBackendSCSINewLun:3788 : Trying to create volume for '/dev/sdb'
> 2018-12-12 02:34:46.207+0000: 17898: debug :
> virStorageBackendDetectBlockVolFormatFD:1485 : cannot determine the target
> format for
> '/dev/disk/by-path/pci-0000:04:00.1-vport-0x5001a4a000000007-fc-
> 0x20010002ac01e23d-lun-0'
> 2018-12-12 02:34:46.207+0000: 17898: debug : virFileClose:111 : Closed fd 22
> 2018-12-12 02:34:46.207+0000: 17898: debug : virCommandRunAsync:2476 : About
> to run /lib/udev/scsi_id --replace-whitespace --whitelisted --device
> /dev/disk/by-path/pci-0000:04:00.1-vport-0x5001a4a000000007-fc-
> 0x20010002ac01e23d-lun-0
> 2018-12-12 02:34:46.208+0000: 17898: debug : virFileClose:111 : Closed fd 22
> 2018-12-12 02:34:46.208+0000: 17898: debug : virFileClose:111 : Closed fd 25
> 2018-12-12 02:34:46.208+0000: 17898: debug : virFileClose:111 : Closed fd 27
> 2018-12-12 02:34:46.208+0000: 17898: debug : virCommandRunAsync:2479 :
> Command result 0, with PID 18044
> 2018-12-12 02:34:46.227+0000: 17898: debug : virCommandRun:2327 : Result
> status 0, stdout: '360002ac000000000000005860001e23d
> ' stderr: '2018-12-12 02:34:46.216+0000: 18044: debug : virFileClose:111 :
> Closed fd 25
> 2018-12-12 02:34:46.216+0000: 18044: debug : virFileClose:111 : Closed fd 27
> 2018-12-12 02:34:46.216+0000: 18044: debug : virFileClose:111 : Closed fd 22
> =====================================
> full log can be got from ms onedrive:
> https://1drv.ms/u/s!AgdSC5Ad4nVqx1oDlZljFFOP2pwa

this log seems the vol is created. btw i cannot open the url of the log.
@John, 
Did we change the logical of the scsi pool to remove duplicated multipath devices? thx

Comment 8 Adam Xu 2018-12-12 06:09:21 UTC

I put it in Google drive.
https://drive.google.com/open?id=1xAnnzjYUUGOessg7oPnzWnG92LObdRWP
Hope it works.

since there's less unit in pool of libvirt 4.5, will the performance of the multipath device lower that before?
take my vm for example, there were 8 sdx device before and there are 2 sdx device left now.

Comment 9 yisun 2018-12-12 10:00:58 UTC

(In reply to Adam Xu from comment #8)
> I put it in Google drive.
> https://drive.google.com/open?id=1xAnnzjYUUGOessg7oPnzWnG92LObdRWP
> Hope it works.
> 
> since there's less unit in pool of libvirt 4.5, will the performance of the
> multipath device lower that before?
> take my vm for example, there were 8 sdx device before and there are 2 sdx
> device left now.

Thx a lot, the log can be accessed. 

Hi John,
I saw the log has error at:
===============================
2018-12-12 02:34:46.227+0000: 17898: debug : virStorageBackendSCSIFindLUs:4125 : Found possible LU '10:0:1:0'
2018-12-12 02:34:46.227+0000: 17898: debug : processLU:4054 : Processing LU 10:0:1:0
2018-12-12 02:34:46.227+0000: 17898: debug : getDeviceType:4025 : Device type is 0
2018-12-12 02:34:46.227+0000: 17898: debug : processLU:4071 : 10:0:1:0 is a Direct-Access LUN
2018-12-12 02:34:46.227+0000: 17898: debug : getNewStyleBlockDevice:3852 : Looking for block device in '/sys/bus/scsi/devices/10:0:1:0/block'
2018-12-12 02:34:46.227+0000: 17898: debug : getNewStyleBlockDevice:3861 : Block device is 'sdc'
2018-12-12 02:34:46.227+0000: 17898: debug : virStorageBackendSCSINewLun:3788 : Trying to create volume for '/dev/sdc'
2018-12-12 02:34:46.228+0000: 17898: debug : virStorageBackendDetectBlockVolFormatFD:1485 : cannot determine the target format for '/dev/disk/by-path/pci-0000:04:00.1-vport-0x5001a4a000000007-fc-0x21010002ac01e23d-lun-0'
2018-12-12 02:34:46.228+0000: 17898: debug : virFileClose:111 : Closed fd 22
2018-12-12 02:34:46.228+0000: 17898: debug : virCommandRunAsync:2476 : About to run /lib/udev/scsi_id --replace-whitespace --whitelisted --device /dev/disk/by-path/pci-0000:04:00.1-vport-0x5001a4a000000007-fc-0x21010002ac01e23d-lun-0
2018-12-12 02:34:46.229+0000: 17898: debug : virFileClose:111 : Closed fd 22
2018-12-12 02:34:46.229+0000: 17898: debug : virFileClose:111 : Closed fd 25
2018-12-12 02:34:46.229+0000: 17898: debug : virFileClose:111 : Closed fd 27
2018-12-12 02:34:46.229+0000: 17898: debug : virCommandRunAsync:2479 : Command result 0, with PID 18045
2018-12-12 02:34:46.240+0000: 17898: debug : virCommandRun:2327 : Result status 0, stdout: '360002ac000000000000005860001e23d
' stderr: '2018-12-12 02:34:46.237+0000: 18045: debug : virFileClose:111 : Closed fd 25
2018-12-12 02:34:46.237+0000: 18045: debug : virFileClose:111 : Closed fd 27
2018-12-12 02:34:46.237+0000: 18045: debug : virFileClose:111 : Closed fd 22
'
2018-12-12 02:34:46.240+0000: 17898: debug : virFileClose:111 : Closed fd 26
2018-12-12 02:34:46.241+0000: 17898: debug : virFileClose:111 : Closed fd 23
2018-12-12 02:34:46.241+0000: 17898: info : virObjectNew:248 : OBJECT_NEW: obj=0x7f2b941e1030 classname=virStorageVolObj
2018-12-12 02:34:46.241+0000: 17898: error : virHashAddOrUpdateEntry:341 : internal error: Duplicate key
2018-12-12 02:34:46.241+0000: 17898: info : virObjectUnref:344 : OBJECT_UNREF: obj=0x7f2b941e1030
2018-12-12 02:34:46.241+0000: 17898: info : virObjectUnref:346 : OBJECT_DISPOSE: obj=0x7f2b941e1030
2018-12-12 02:34:46.241+0000: 17898: debug : processLU:4087 : Created new storage volume for 10:0:1:0 successfully
===============================

In commit be1bb6c9, your added virHashAddEntry(volumes->objsKey, voldef->key, volobj) in virStoragePoolObjAddVol(). And since Adam's luns in comment 0 under /dev/disk/by-path/ are actually having the same backend device, they're just multipath. So the "/lib/udev/scsi_id --replace-whitespace --whitelisted --device /dev/disk/by-path/all_thest_devices" will return the same scsi id and assigned to vol->key. And this will cause a "Duplicate key" error when the second/third.. vol being created.
So I assume this was an intentional change by you. If so, could be a NOTABUG.
But one thing confusing me is that when "Duplicate key" error reported, all the functions will return a -1, so seems it should never enter the code to log "processLU:4087 : Created new storage volume for 10:0:1:0 successfully". Maybe something wrong?
thx.

Comment 10 John Ferlan 2018-12-17 20:02:17 UTC

First off - let me go back to the problem statement... Not that this is the problem here, but perhaps a "slight misconception" over the steps to create vHBAs. In particular the described steps:

> # virsh nodedev-create vhba_host1.xml
> 3.show wwnn and wwpn id of the vhba
> # virsh nodedev-dumpxml scsi_host13
> ...
>       <wwnn>5001a4a695e4124d</wwnn>
>       <wwpn>5001a4ad261be7ed</wwpn>
> ...
> 4.create  storage pool:
> # cat vhbapool_host1.xml
>  ...
>    	<source>
> 		<adapter type='fc_host'  wwnn='5001a4a695e4124d' wwpn='5001a4ad261be7ed'/>
> 	</source>
> ...
> 
> # virsh pool-define vhbapool_host1.xml
> it will succeed in rhel 7.5. but failed in rhel 7.6. errors like "the wwnn and wwpn id have been used by a hba device"
> I have to reboot the kvm server and run that command again, this time, it succeed.

The description on https://wiki.libvirt.org/page/NPIV_in_libvirt indicates creating a vHBA is "either directly using the node device driver or indirectly via a libvirt storage pool".  What you attempted to do was create the "same" vHBA when creating the storage pool. The "reboot" of the kvm server would have been unnecessary if you had done a 'virsh nodedev-destroy scsi_hostXX' where the XX is replaced by the scsi_hostXX that was created by the nodedev-create command (in your case scsi_host13). This method is an example of a dynamic or transient vHBA. The "reason" for the storage pool definition is to remove that transientness and create a more persistent definition that uses a storage pool (and storage pool's can be transient too if using virsh pool-create instead of virsh pool-define).  So the 'reason' why your reboot worked for you was that the transient scsi_host13 was removed, but I digress.


So, back to the question... Commit be1bb6c9 was to move volumes from a forward linked list into a hash table for "faster" search capabilities, so it's not an intentional change as it relates to vHBA LUN additions. Volumes are "stored" by 3 different lookup keys (key, name, path) with the goal being to use any of the keys to access the volume. The "theory" is that each should be unique, but that doesn't seem to be the case here since your research shows perhaps a flaw for at least how vHBA LUNs are handled using the /lib/udev/scsi_id command. Perhaps a different way to 'construct' the key needs to be done when the LUN search is being done for a vHBA. I have a couple of ideas (although they get ugly fast). I found that scsi_id also has an --export option which for a vHBA and vHBA LUNs will print a "ID_TARGET_PORT=#" string. That string could be used when creating the serial/key value to create a string. Let's see what I can put together.

Comment 11 Adam Xu 2018-12-18 10:53:17 UTC

HI, John. Thank you for your reply. In fact, the "slight misconception" of creating vHBA comes from here:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-NPIV_storage.html

In the charpters 13.7.1 and 13.7.2, when create the storage pool, it use the same wwnn and wwpn id that come from 13.7.1 step. So I thought these two steps may be related in the past. In fact we can define the storage pool directly after we generate the wwnn and wwpn id, right?

In RHEL7.5 and earlier, This tutorial will not give any error, Only in RHEL7.6, it will give an error like "the wwnn and wwpn id have been used by a hba device"

now, I known that I just need delete the node dev before I create the storage pool. thank you and yisun.

last question:
the vm has installed the multipath package, it has 2 lun devices now while it has 8 lun devices in the past, will the performance in RHEL7.6 be lower than RHEL7.5 in theory？

Comment 12 John Ferlan 2018-12-18 14:03:29 UTC

(In reply to Adam Xu from comment #11)
> HI, John. Thank you for your reply. In fact, the "slight misconception" of
> creating vHBA comes from here:
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/
> html/Virtualization_Deployment_and_Administration_Guide/sect-NPIV_storage.
> html
> 
> In the charpters 13.7.1 and 13.7.2, when create the storage pool, it use the
> same wwnn and wwpn id that come from 13.7.1 step. So I thought these two
> steps may be related in the past. In fact we can define the storage pool
> directly after we generate the wwnn and wwpn id, right?
> 

The RHEL docs were created from the wiki I listed above and I suppose I can see how they can be confusing.

> In RHEL7.5 and earlier, This tutorial will not give any error, Only in
> RHEL7.6, it will give an error like "the wwnn and wwpn id have been used by
> a hba device"

Strange, I don't recall anything changing that would cause that, but I'll look. It's a separate issue, let me focus on the missing/unreported LUNs first. I can reproduced what was seen, but my test environment is "flaky" right now. I'm hoping to post something upstream shortly.

> 
> now, I known that I just need delete the node dev before I create the
> storage pool. thank you and yisun.
> 
> last question:
> the vm has installed the multipath package, it has 2 lun devices now while
> it has 8 lun devices in the past, will the performance in RHEL7.6 be lower
> than RHEL7.5 in theory？

The performance has nothing to do with the LUNs themselves... It's more of a number of LUNs type thing. If you had 100 or 1000 LUNs, then it would potentially take compute time to walk that list in order to "find" any particular LUN; whereas, with a hash table, the lookup is much faster since there's at most I think 6-10 LUN's in any one hash bucket. I haven't done any sort of real characterization. It's mostly an algorithm type observation.

Comment 13 John Ferlan 2018-12-18 22:47:51 UTC

Posted a patch upstream for this:

https://www.redhat.com/archives/libvir-list/2018-December/msg00562.html

Comment 14 John Ferlan 2019-02-01 15:54:19 UTC

A second round of patches was posted:

https://www.redhat.com/archives/libvir-list/2019-January/msg00657.html


With a couple of review mods, these are now pushed upstream:

commit 850cfd75beb7872b20439eccda0bcf7b68cab525
Author: John Ferlan <jferlan>
Date:   Fri Jan 18 08:33:10 2019 -0500

    storage: Fetch a unique key for vHBA/NPIV LUNs
    
...
    
    Commit be1bb6c95 changed the way volumes were stored from a forward
    linked list to a hash table. In doing so, it required that each vol
    object would have 3 unique values as keys into tables - key, name,
    and path. Due to how vHBA/NPIV LUNs are created/used this resulted
    in a failure to utilize all the LUN's found during processing.
    
    During virStorageBackendSCSINewLun processing fetch the key (or
    serial value) for NPIV LUN's using virStorageFileGetNPIVKey which
    will formulate a more unique key based on the serial value and
    the port for the LUN.
    
    Signed-off-by: John Ferlan <jferlan>
    ACKed-by: Michal Privoznik <mprivozn>
    Reviewed-by: Ján Tomko <jtomko>

$ git describe 850cfd75beb7872b20439eccda0bcf7b68cab525
v5.0.0-203-g850cfd75be
$

Comment 18 yisun 2019-04-18 09:09:16 UTC

Test with libvirt-4.5.0-12.virtcov.el7.x86_64 and result is: PASSED

Test steps:
1. having a pool xml as follow:
[root@dell-per730-58 ~]# cat pool 
<pool type='scsi'>
<name>vp</name>
<source>
<adapter type='fc_host' wwnn='20000000c99e2b80' wwpn='1000000000000002' parent='scsi_host11'/>
</source>
<target>
<path>/dev/disk/by-path</path>
<permissions>
<mode>0700</mode>
<owner>0</owner>
<group>0</group>
</permissions>
</target>
</pool>

2. start the pool
[root@dell-per730-58 ~]# virsh pool-define pool
Pool vp defined from pool

[root@dell-per730-58 ~]# virsh pool-start vp
Pool vp started

3. check the newly connected luns
[root@dell-per730-58 ~]# lsscsi
...
[120:0:0:0]  disk    IBM      2145             0000  /dev/sdd 
[120:0:0:1]  disk    IBM      2145             0000  /dev/sde 
[120:0:1:0]  disk    IBM      2145             0000  /dev/sdf 
[120:0:1:1]  disk    IBM      2145             0000  /dev/sdg 

4. check their multipath device, we can see sdg&sde and sdd&sdf having same backends
[root@dell-per730-58 ~]# multipath -ll
mpathe (360050763008084e6e000000000000062) dm-6 IBM     ,2145            
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 120:0:1:1 sdg 8:96 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 120:0:0:1 sde 8:64 active ready running
mpathd (360050763008084e6e000000000000066) dm-5 IBM     ,2145            
size=15G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 120:0:1:0 sdf 8:80 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 120:0:0:0 sdd 8:48 active ready running

5. vol-list the pool, all devices should be listed even if they have same backends
[root@dell-per730-58 ~]# virsh vol-list vp
 Name                 Path                                    
------------------------------------------------------------------------------
 unit:0:0:0           /dev/disk/by-path/pci-0000:06:00.0-vport-0x1000000000000002-fc-0x50050768030939b6-lun-0
 unit:0:0:1           /dev/disk/by-path/pci-0000:06:00.0-vport-0x1000000000000002-fc-0x50050768030939b6-lun-1
 unit:0:1:0           /dev/disk/by-path/pci-0000:06:00.0-vport-0x1000000000000002-fc-0x50050768030939b7-lun-0
 unit:0:1:1           /dev/disk/by-path/pci-0000:06:00.0-vport-0x1000000000000002-fc-0x50050768030939b7-lun-1

Comment 20 errata-xmlrpc 2019-08-06 13:14:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2294