Bug 1692199 - OvS fails to attach to vfio-pci PCI device on boot
Summary: OvS fails to attach to vfio-pci PCI device on boot
Keywords:
Status: CLOSED DUPLICATE of bug 1683817
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Aaron Conole
QA Contact: Roee Agiman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-25 00:54 UTC by Brendan Shephard
Modified: 2022-06-03 05:45 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-05 17:04:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
lspci output before restarting OvS (4.48 KB, text/plain)
2019-03-25 00:57 UTC, Brendan Shephard
no flags Details
lspci output after restart OvS (4.53 KB, text/plain)
2019-03-25 00:58 UTC, Brendan Shephard
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-15522 0 None None None 2022-06-03 05:45:40 UTC

Description Brendan Shephard 2019-03-25 00:54:47 UTC
Description of problem:
After upgrading from

Version-Release number of selected component (if applicable):
openvswitch-2.9.0-83.el7fdp.1.x86_64
kernel-3.10.0-957.5.1.el7.x86_64

How reproducible:
Inconsistent. Large number of nodes have been updated. The issue occurs without fail every time on a handful of them, but not at all on the others.

Steps to Reproduce:
1. Perform overcloud upgrade
2. Reboot compute dpdk nodes
3.

Actual results:
Overview of events attached:

Summary: 
ComputeDPDK nodes are booted, PCI cards initialised. lsmod shows no devices using vfio-pci:

vfio_pci               41268  0
vfio_iommu_type1       22300  0
vfio                   32656  2 vfio_iommu_type1,vfio_pci
irqbypass              13503  2 kvm,vfio_pci

The lspci output shows that the card is using vfio-pci:
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 01
                VF offset: 128, stride: 2, Device ID: 10ed
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 00000000c9200000 (64-bit, prefetchable)
                Region 3: Memory at 00000000c9100000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Kernel driver in use: vfio-pci		<--- We are seeing vfio-pci detected the PCI
        Kernel modules: ixgbe

DPDK fails to attach to this NIC:
Mar 21 01:10:17 RETRACTED_HOSTNAME network: Bringing up interface bond-link1:  ovs-vsctl: Error detected while setting up 'dpdk0': Error attaching device '0000:84:00.0' to DPDK.  See ovs-vswitchd log for details.
Mar 21 01:10:17 RETRACTED_HOSTNAME network: ovs-vsctl: Error detected while setting up 'dpdk1': Error attaching device '0000:84:00.1' to DPDK.  See ovs-vswitchd log for details.



Expected results:
A openvswitch restart shouldn't be required to make the dpdk nics work.

Additional info:
Summary:

The issue that is occurring here almost certainly looks to be that the PCI card is still using the ixgbe driver and not vfio-pci. We can see that in the output for lspci that the card reports using vfio-pci, but when we check lsmod we see 0 devices using vfio_pci:


vfio_pci               41268  0
vfio_iommu_type1       22300  0
vfio                   32656  2 vfio_iommu_type1,vfio_pci
irqbypass              13503  2 kvm,vfio_pci


84:00.1 Ethernet controller [0200]: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection [8086:10f8] (rev 01)
        Physical Slot: 0-5

...

        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 01
                VF offset: 128, stride: 2, Device ID: 10ed
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 00000000c9200000 (64-bit, prefetchable)
                Region 3: Memory at 00000000c9100000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Kernel driver in use: vfio-pci		<--- We are seeing vfio-pci detected the PCI
        Kernel modules: ixgbe

Subsequently, DPDK isn't able to use this card. This is definitely the problem, but why is this the case?

Once OvS is restarted after the reboot, this changes and lsmod reflects all 4 devices using vfio_pci.

[root@cairo02dc01ovsdpdkcompute13 ~]# systemctl restart neutron-openvswitch-agent openvswitch.service
[root@cairo02dc01ovsdpdkcompute13 ~]# lsmod | grep vfio
vfio_pci               41268  4
vfio_iommu_type1       22300  1
vfio                   32656  11 vfio_iommu_type1,vfio_pci
irqbypass              13503  6 kvm,vfio_pci


diff between working (<) and non-working (>) pci device:

diff working_pci not_working_pci 
3c3
<         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
---
>         Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
5d4
<         Latency: 0, Cache Line Size: 32 bytes
8,10c7,9
<         Region 0: Memory at c8400000 (64pcilib: sysfs_read_vpd: read failed: Input/output error
< -bit, non-prefetchable) [size=1M]
<         Region 2: I/O ports at 9020 [size=32]
---
>         Region 0: Memory at c8400000 (64-bit, non-prefetchable) [size=1M]
>         Regipcilib: sysfs_read_vpd: read failed: Input/output error
> on 2: I/O ports at 9020 [size=32]
19c18
<         Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
---
>         Capabilities: [70] MSI-X: Enable- Count=64 Masked-
28c27
<                 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
---
>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
61,62c60,61
<                 Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
<                 Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
---
>                 Region 0: Memory at 00000000c9100000 (64-bit, prefetchable)
>                 Region 3: Memory at 00000000c9200000 (64-bit, prefetchable)

The outputs were taken before and after restarting ovs on the same system. The above output shows the changes.

Comment 2 Brendan Shephard 2019-03-25 00:57:47 UTC
Created attachment 1547548 [details]
lspci output before restarting OvS

Comment 3 Brendan Shephard 2019-03-25 00:58:09 UTC
Created attachment 1547549 [details]
lspci output after restart OvS


Note You need to log in before you can comment on or make changes to this bug.