Description of problem: After upgrading from 4.2.8-300.fc23.x86_64 to 4.3.3-300.fc23.x86_64, dmesg | egrep -i âmlx|dmar' [ 17.816756] mlx4_core 0000:82:00.0: Mapped 1 chunks/256 KB at 120040000 for ICM [ 17.825330] mlx4_core 0000:8a:00.0: SRIOV, disabling HA mode for intf proto 0 [ 17.825541] <mlx4_ib> mlx4_ib_add: counter index 0 for port 1 allocated 0 [ 17.833869] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0 [ 17.906397] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 120040000 for ICM [ 17.911403] mlx4_core 0000:8a:00.0: mlx4_ib: multi-function enabled [ 17.925065] mlx4_core 0000:8a:00.0: mlx4_ib: initializing demux service for 128 qp1 clients [ 17.937459] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 128040000 for ICM [ 17.938766] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 1200c0000 for ICM [ 29.527780] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 128080000 for ICM [ 29.529083] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 120140000 for ICM [ 31.330799] DMAR: DRHD: handling fault status reg 2 [ 31.330803] DMAR: DMAR:[DMA Write] Request device [8a:06.1] fault addr fc26e000 DMAR:[fault reason 02] Present bit in context entry is clear [ 31.330865] DMAR: DRHD: handling fault status reg 102 [ 31.330868] DMAR: DMAR:[DMA Read] Request device [8a:06.1] fault addr fc632000 DMAR:[fault reason 02] Present bit in context entry is clear [ 31.530006] DMAR: DRHD: handling fault status reg 202 . . . I have two IB cards: all Firmware version: 2.9.1000 82:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) 8a:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) The first one has sriov off the second has sriov on. Each card only has the first port connected, in this case to the same switch. The connection on the second card also just shows initializing. If I disable sriov by not setting num_vfs then card works fine. I've further verified that the problems does not exist on stock kernel.org 2.4.8 but does exist on 4.3-rc1, and the problem exists on the latest I tried, 4.5.0-0.rc0.git6.1.vanilla.knurd.1.fc23.x86_64 I would have to guess the problem was introduced by the commits to iommu or mlx4 in 4.3-rc1 but there are quite a few of them. Anything else I can do to help debug/bisect please let me know. Version-Release number of selected component (if applicable): See description of IB cards above How reproducible: Just boot with 4.3.3-301.fc23 Steps to Reproduce: Have sr-iov enabled on an ib card and set the mlx_core param num_vfs Actual results: ib card with srio-iov and virtual functions enabled do not work Expected results: ib card with srio-iov and virtual functions enabled work Additional info:
This issue is resolved on this particular hw by upgrading from: Bios Version: SE5C600.86B.02.02.0002.122320131210 BMC Firmware Version: 1.20.5446 SDR Version: SDR Package 1.12 ME Firmware Version: 2.1.7.328 Platform ID: S2600GZ to: Bios Version: SE5C600.86B.02.03.0003.041920141333 BMC Firmware Version: 1.21.6580 SDR Version: SDR Package 1.13 ME Firmware Version: 2.1.7.328 Platform ID: S2600GZ
I was too quick to believe the BIOS fixed it. The BIOS flash reset the vt io remapping to off. I need this feature to map the sriov IB virtual functions into the virtual machines. Upon re-enabling this in the BIOS the DMAR errors resume and there is no access to the IB virtual functions.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs. Fedora 23 has now been rebased to 4.7.4-100.fc23. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.
I still have this Problem with an InfiniBand card InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0) and Fedora 25 (kernel-4.9.13-201.fc25.x86_64). For now my workaround is to downgrade the firmware on the InfiniBand card to version 2.7.000 with version 2.9.1000 I get the messages from the description above and the card or better the way it works with the kernel breaks everything (can't use the system this way).
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days