Bug 1301210 - DMAR:[DMA Write] Request device [8a:06.1] fault addr fc26e000 [NEEDINFO]
DMAR:[DMA Write] Request device [8a:06.1] fault addr fc26e000
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
23
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-22 16:24 EST by Nate Pearlstein
Modified: 2017-03-14 08:47 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-10-26 12:45:49 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
labbott: needinfo? (npearl)


Attachments (Terms of Use)

  None (edit)
Description Nate Pearlstein 2016-01-22 16:24:04 EST
Description of problem:

After upgrading from 4.2.8-300.fc23.x86_64 to 4.3.3-300.fc23.x86_64,

dmesg | egrep -i ‘mlx|dmar'

[   17.816756] mlx4_core 0000:82:00.0: Mapped 1 chunks/256 KB at 120040000 for ICM
[   17.825330] mlx4_core 0000:8a:00.0: SRIOV, disabling HA mode for intf proto 0
[   17.825541] <mlx4_ib> mlx4_ib_add: counter index 0 for port 1 allocated 0
[   17.833869] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0
[   17.906397] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 120040000 for ICM
[   17.911403] mlx4_core 0000:8a:00.0: mlx4_ib: multi-function enabled
[   17.925065] mlx4_core 0000:8a:00.0: mlx4_ib: initializing demux service for 128 qp1 clients
[   17.937459] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 128040000 for ICM
[   17.938766] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 1200c0000 for ICM
[   29.527780] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 128080000 for ICM
[   29.529083] mlx4_core 0000:8a:00.0: Mapped 1 chunks/256 KB at 120140000 for ICM
[   31.330799] DMAR: DRHD: handling fault status reg 2
[   31.330803] DMAR: DMAR:[DMA Write] Request device [8a:06.1] fault addr fc26e000
             DMAR:[fault reason 02] Present bit in context entry is clear
[   31.330865] DMAR: DRHD: handling fault status reg 102
[   31.330868] DMAR: DMAR:[DMA Read] Request device [8a:06.1] fault addr fc632000
             DMAR:[fault reason 02] Present bit in context entry is clear
[   31.530006] DMAR: DRHD: handling fault status reg 202
.
.
.

I have two IB cards: all Firmware version: 2.9.1000

82:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
8a:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)

The first one has sriov off the second has sriov on. 

Each card only has the first port connected, in this case to the same switch.

The connection on the second card also just shows initializing.
If I disable sriov by not setting num_vfs then card works fine.

I've further verified that the problems does not exist on stock kernel.org
2.4.8 but does exist on 4.3-rc1, and the problem exists on the latest I tried,
4.5.0-0.rc0.git6.1.vanilla.knurd.1.fc23.x86_64

I would have to guess the problem was introduced by the commits to iommu
or mlx4 in 4.3-rc1 but there are quite a few of them.

Anything else I can do to help debug/bisect please let me know.


Version-Release number of selected component (if applicable):

See description of IB cards above

How reproducible:

Just boot with 4.3.3-301.fc23

Steps to Reproduce:

Have sr-iov enabled on an ib card and set the mlx_core param num_vfs


Actual results:

ib card with srio-iov and virtual functions enabled do not work

Expected results:

ib card with srio-iov and virtual functions enabled work

Additional info:
Comment 1 Nate Pearlstein 2016-01-28 15:35:44 EST
This issue is resolved on this particular hw by upgrading 

from:

Bios Version: SE5C600.86B.02.02.0002.122320131210
BMC Firmware Version: 1.20.5446
SDR Version: SDR Package 1.12
ME Firmware Version: 2.1.7.328
Platform ID: S2600GZ

to:

Bios Version: SE5C600.86B.02.03.0003.041920141333
BMC Firmware Version: 1.21.6580
SDR Version: SDR Package 1.13
ME Firmware Version: 2.1.7.328
Platform ID: S2600GZ
Comment 2 Nate Pearlstein 2016-02-01 14:11:52 EST
I was too quick to believe the BIOS fixed it.  The BIOS flash reset the vt io remapping to off.  I need this feature to map the sriov IB virtual functions into the virtual machines. Upon re-enabling this in the BIOS the DMAR errors resume and there is no access to the IB virtual functions.
Comment 3 Laura Abbott 2016-09-23 15:28:23 EDT
*********** MASS BUG UPDATE **************
 
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs.
 
Fedora 23 has now been rebased to 4.7.4-100.fc23.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25.
 
If you experience different issues, please open a new bug report for those.
Comment 4 Laura Abbott 2016-10-26 12:45:49 EDT
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.
Comment 5 JM 2017-03-14 08:47:46 EDT
I still have this Problem with an InfiniBand card 

InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

and Fedora 25 (kernel-4.9.13-201.fc25.x86_64).

For now my workaround is to downgrade the firmware on the InfiniBand card to version 2.7.000 with version 2.9.1000 I get the messages from the description above and the card or better the way it works with the kernel breaks everything (can't use the system this way).

Note You need to log in before you can comment on or make changes to this bug.