Bug 1476227 - libvirt hangs after failed to create a vHBA (npiv vport)
libvirt hangs after failed to create a vHBA (npiv vport)
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt (Show other bugs)
7.4
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: John Ferlan
yisun
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-28 06:58 EDT by yisun
Modified: 2017-12-04 09:40 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-12-04 09:40:56 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description yisun 2017-07-28 06:58:42 EDT
Description:
libvirt hangs after failed to create a vHBA (npiv vport)

How reproduced:
100%

Versions:
kernel-3.10.0-693.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64
libvirt-3.2.0-14.el7_4.2.x86_64


Steps:
1. Having an online HBA
## virsh nodedev-dumpxml scsi_host7
<device>
  <name>scsi_host7</name>
  <path>/sys/devices/pci0000:00/0000:00:01.0/0000:20:00.1/host7</path>
  <parent>pci_0000_20_00_1</parent>
  <capability type='scsi_host'>
    <host>7</host>
    <unique_id>1</unique_id>
    <capability type='fc_host'>
      <wwnn>20000000c99e2b81</wwnn>
      <wwpn>10000000c99e2b81</wwpn>
      <fabric_wwn>2001547feeb71cc1</fabric_wwn>
    </capability>
    <capability type='vport_ops'>
      <max_vports>255</max_vports>
      <vports>0</vports>
    </capability>
  </capability>
</device>


2. prepare a xml for vHBA creation with parent=above HBA
## cat nodedev.xml
<device>
    <capability type="scsi_host">
        <capability type="fc_host">
            <wwnn>20000000c99e2b80</wwnn>
            <wwpn>1000000000000001</wwpn>
        </capability>
    </capability>
    <parent>scsi_host7</parent>
</device>

3. try to create vHBA (this will be failed in my enviornment, and reason is provided in *Addition info* part)
## virsh nodedev-create nodedev.xml
error: Disconnected from qemu:///system due to keepalive timeout
error: Failed to create node device from nodedev.xml
error: internal error: connection closed due to keepalive timeout

4. now libvirt hangs, a "virsh list" will just hang there until I ctrl
## time virsh list
^C
real    4m26.726s
user    0m0.006s
sys    0m0.003s


Expected result:
libvirt should not hang there even if vHBA creation failed

Actual result:
libvirt hangs



Additional info:
1. vHBA cannot be created with pure kernel command either, as follow:
## echo "1000000000000001:20000000c99e2b80" > /sys/class/fc_host/host7/vport_create
-bash: echo: write error: Interrupted system call

2. and in messages log, we can find following error
## cat /var/log/messages
44643 Jul 28 17:50:15 bootp-73-75-161 kernel: scsi host10: Emulex LPe12002-M8 8Gb 2-port PCIe Fibre Channel Adapter on PCI bus 20 device 01 irq 17 port 1
44644 Jul 28 17:50:15 bootp-73-75-161 kernel: lpfc 0000:20:00.1: 1:(1):2528 Mailbox command x8d cannot issue Data: x0 x2
44645 Jul 28 17:50:15 bootp-73-75-161 kernel: lpfc 0000:20:00.1: 1:(1):1818 VPort failed init, mbxCmd x8d READ_SPARM mbxStatus x0, rc = xff
44646 Jul 28 17:50:15 bootp-73-75-161 kernel: lpfc 0000:20:00.1: 1:(1):1813 Create VPORT failed. Cannot get sparam
44647 Jul 28 17:50:15 bootp-73-75-161 kernel: FC Virtual Port LLDD Create failed      

3. Our HBA card is from Emulex Corporation
## lspci
...
20:00.0 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter (rev 03)
20:00.1 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter (rev 03)

4. So I googled its manual, and seems driver errors (error numbers are 1813 and 1818, as in above messages log)

ftp://ftp.software.ibm.com/systems/support/system_x_cluster/hbanyware-4.1a36a.pdf
...
elx_mes1813 Create VPORT failed. Cannot get sparam.
DESCRIPTION: The port could not be created beca
use it could not be initialized possibly due to
unavailable resources.
DATA: None
SEVERITY: Error
LOG: LOG_VPORT verbose
ACTION: Software driver error. If this problem
persists, report these errors to Technical Support.
...
elx_mes1818 VPort failed init, mbxCmd <mailbox command> READ_SPARM mbxStatus
<mailbox status>, rc = <status>
DESCRIPTION: A pending mailbox command
 issued to initialize port, failed.
DATA: (1) mbxCommand (2) mbxStatus (3) rc
SEVERITY: Error
LOG: LOG_VPORT verbose
ACTION: Software driver error. If this problem
persists, report these errors to Technical Support.
...

strange thing is this card worked well with elder kernel versions. not sure if there is also a kernel issue?
Comment 3 John Ferlan 2017-08-02 17:07:17 EDT
Well not being able to use the vport_create command directly would seem to mean to me that something that libvirt is relying on is behaving badly.

The "hang" you ^C'd out of doesn't help at unless you attach to the libvirtd daemon in gdb and then provide the results of a 'bt' for all the threads. That'll at least give a shred of possibility at figuring out why libvirtd is very unhappy when the creation of a vport has issues. I'd "assume" it has to do with nodedev driver interaction with udev, but that's purely a guess.

In any case, the commands w/ the provided wwnn/wwpn worked for me with recent upstream:

# virsh version
Compiled against library: libvirt 3.6.0
Using library: libvirt 3.6.0
Using API: QEMU 3.6.0
Running hypervisor: QEMU 2.6.2


I also checked out a v3.2-maint release, rebuilt, and ran successfully.

# virsh version
Compiled against library: libvirt 3.2.1
Using library: libvirt 3.2.1
Using API: QEMU 3.2.1
Running hypervisor: QEMU 2.6.2

Example run:

# virsh nodedev-create bz1476227.xml
Node device scsi_host36 created from bz1476227.xml

# virsh nodedev-dumpxml scsi_host36
<device>
  <name>scsi_host36</name>
  <path>/sys/devices/pci0000:00/0000:00:04.0/0000:10:00.1/host4/vport-4:0-1/host36</path>
  <parent>scsi_host4</parent>
  <capability type='scsi_host'>
    <host>36</host>
    <unique_id>33</unique_id>
    <capability type='fc_host'>
      <wwnn>20000000c99e2b80</wwnn>
      <wwpn>1000000000000001</wwpn>
      <fabric_wwn>2002000573de9681</fabric_wwn>
    </capability>
  </capability>
</device>
Comment 4 John Ferlan 2017-12-04 09:40:56 EST
Closing this as works for me since I cannot reproduce and it seems as though from the problem report that this is not a libvirt problem, but rather a kernel driver problem.

Note You need to log in before you can comment on or make changes to this bug.