Description of problem: After upgrading from Version-Release number of selected component (if applicable): openvswitch-2.9.0-83.el7fdp.1.x86_64 kernel-3.10.0-957.5.1.el7.x86_64 How reproducible: Inconsistent. Large number of nodes have been updated. The issue occurs without fail every time on a handful of them, but not at all on the others. Steps to Reproduce: 1. Perform overcloud upgrade 2. Reboot compute dpdk nodes 3. Actual results: Overview of events attached: Summary: ComputeDPDK nodes are booted, PCI cards initialised. lsmod shows no devices using vfio-pci: vfio_pci 41268 0 vfio_iommu_type1 22300 0 vfio 32656 2 vfio_iommu_type1,vfio_pci irqbypass 13503 2 kvm,vfio_pci The lspci output shows that the card is using vfio-pci: Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy- IOVSta: Migration- Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 01 VF offset: 128, stride: 2, Device ID: 10ed Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 00000000c9200000 (64-bit, prefetchable) Region 3: Memory at 00000000c9100000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Kernel driver in use: vfio-pci <--- We are seeing vfio-pci detected the PCI Kernel modules: ixgbe DPDK fails to attach to this NIC: Mar 21 01:10:17 RETRACTED_HOSTNAME network: Bringing up interface bond-link1: ovs-vsctl: Error detected while setting up 'dpdk0': Error attaching device '0000:84:00.0' to DPDK. See ovs-vswitchd log for details. Mar 21 01:10:17 RETRACTED_HOSTNAME network: ovs-vsctl: Error detected while setting up 'dpdk1': Error attaching device '0000:84:00.1' to DPDK. See ovs-vswitchd log for details. Expected results: A openvswitch restart shouldn't be required to make the dpdk nics work. Additional info: Summary: The issue that is occurring here almost certainly looks to be that the PCI card is still using the ixgbe driver and not vfio-pci. We can see that in the output for lspci that the card reports using vfio-pci, but when we check lsmod we see 0 devices using vfio_pci: vfio_pci 41268 0 vfio_iommu_type1 22300 0 vfio 32656 2 vfio_iommu_type1,vfio_pci irqbypass 13503 2 kvm,vfio_pci 84:00.1 Ethernet controller [0200]: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection [8086:10f8] (rev 01) Physical Slot: 0-5 ... Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy- IOVSta: Migration- Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 01 VF offset: 128, stride: 2, Device ID: 10ed Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 00000000c9200000 (64-bit, prefetchable) Region 3: Memory at 00000000c9100000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Kernel driver in use: vfio-pci <--- We are seeing vfio-pci detected the PCI Kernel modules: ixgbe Subsequently, DPDK isn't able to use this card. This is definitely the problem, but why is this the case? Once OvS is restarted after the reboot, this changes and lsmod reflects all 4 devices using vfio_pci. [root@cairo02dc01ovsdpdkcompute13 ~]# systemctl restart neutron-openvswitch-agent openvswitch.service [root@cairo02dc01ovsdpdkcompute13 ~]# lsmod | grep vfio vfio_pci 41268 4 vfio_iommu_type1 22300 1 vfio 32656 11 vfio_iommu_type1,vfio_pci irqbypass 13503 6 kvm,vfio_pci diff between working (<) and non-working (>) pci device: diff working_pci not_working_pci 3c3 < Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ --- > Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- 5d4 < Latency: 0, Cache Line Size: 32 bytes 8,10c7,9 < Region 0: Memory at c8400000 (64pcilib: sysfs_read_vpd: read failed: Input/output error < -bit, non-prefetchable) [size=1M] < Region 2: I/O ports at 9020 [size=32] --- > Region 0: Memory at c8400000 (64-bit, non-prefetchable) [size=1M] > Regipcilib: sysfs_read_vpd: read failed: Input/output error > on 2: I/O ports at 9020 [size=32] 19c18 < Capabilities: [70] MSI-X: Enable+ Count=64 Masked- --- > Capabilities: [70] MSI-X: Enable- Count=64 Masked- 28c27 < DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- --- > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- 61,62c60,61 < Region 0: Memory at 0000000000000000 (64-bit, prefetchable) < Region 3: Memory at 0000000000000000 (64-bit, prefetchable) --- > Region 0: Memory at 00000000c9100000 (64-bit, prefetchable) > Region 3: Memory at 00000000c9200000 (64-bit, prefetchable) The outputs were taken before and after restarting ovs on the same system. The above output shows the changes.
Created attachment 1547548 [details] lspci output before restarting OvS
Created attachment 1547549 [details] lspci output after restart OvS