Description of problem: Upgraded firmware on mellanox infiniband connectx hardware and got following spam in system log. The mellanox device no longer works in f13/f14, but does work with iommu=off and rhel6.0. [ 1680.962538] DRHD: handling fault status reg 302 [ 1680.967048] DMAR:[DMA Read] Request device [04:00.6] fault addr f647a000 [ 1680.967049] DMAR:[fault reason 02] Present bit in context entry is clear This also broke infiniband on f13/f14 machines. I used the iommu workaround described in Bug #605888, and infiniband works but not sure what performance mpact is. RHEL6.0 doesn't exhibit this problem with the updated firmware: Linux mrg-03.mpc.lab.eng.bos.redhat.com 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue Mx Is this a kernel.org regression? lspci results: [root@mrg-04 ~]# lspci 00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 13) 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Po) 00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Po) 00:04.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 4 ) 00:05.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 ) 00:06.0 PCI bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 6 ) 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Po) 00:09.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Po) 00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Register) 00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Regis) 00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Reg) 00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Control) 00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Control) 00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Contro) 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Control) 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Control) 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Contro) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92) 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller () 00:1f.2 IDE interface: Intel Corporation 82801IB (ICH9) 2 port SATA IDE Control) 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit ) 01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit ) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit ) 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit ) 03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 0) 04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s ) 08:03.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200eW WPCM450 (re) fe:00.0 Host bridge: Intel Corporation Xeon 5500/Core i7 QuickPath Architecture) fe:00.1 Host bridge: Intel Corporation Xeon 5500/Core i7 QuickPath Architecture) fe:02.0 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Link 0 (rev 05) fe:02.1 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Physical 0 (rev 05) fe:02.4 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Link 1 (rev 05) fe:02.5 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Physical 1 (rev 05) fe:03.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:03.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:03.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:03.4 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:04.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:04.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:04.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:04.3 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:05.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:05.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:05.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:05.3 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:06.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:06.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:06.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) fe:06.3 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:00.0 Host bridge: Intel Corporation Xeon 5500/Core i7 QuickPath Architecture) ff:00.1 Host bridge: Intel Corporation Xeon 5500/Core i7 QuickPath Architecture) ff:02.0 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Link 0 (rev 05) ff:02.1 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Physical 0 (rev 05) ff:02.4 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Link 1 (rev 05) ff:02.5 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Physical 1 (rev 05) ff:03.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:03.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:03.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:03.4 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:04.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:04.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:04.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:04.3 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:05.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:05.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:05.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:05.3 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:06.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:06.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:06.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) ff:06.3 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Cont) [root@mrg-04 ~]# Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This is either A) an upstream regression (I don't put any infiniband patches into the Fedora kernels, whatever is upstream is what Fedora gets), or B) a buggy Mellanox firmware. And if it's the Mellanox SRIOV enabled firmware, then I could almost expect it's their firmware, or at least a firmware/BIOS interaction issue.
problem still exists in f15 with latest firmware from Mellanox: Linux mrg-02.mpc.lab.eng.bos.redhat.com 2.6.40-4.fc15.x86_64 #1 SMP Fri Jul 29 18:46:53 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux [root@mrg-02 ~]# flint -d /dev/mst/*cr0 q full Image type: ConnectX FW Version: 2.9.1000 Device ID: 26428 Description: Node Port1 Port2 Sys image GUIDs: 0002c903000839f0 0002c903000839f1 0002c903000839f2 0002c903000839f3 MACs: 000000000000 000000000001 Board ID: (MT_0BB0120003) VSD: PSID: MT_0BB0120003 Doug, can you do anything to kick Mellanox into action? Thanks -steve
I'll see what I can do. Vlad?
This is basically with a pristine upstream kernel. We renamed 3.0 to 2.6.40 in order to avoid problems with user space packages that break with the 3.0 version string.
BTW, my Mellanox cards work fine with the latest kernel. I suspect your issue might not be in the Mellanox driver but instead in the core IOMMU code.
Doug, Did you try an f13/f14/f15 kernel with the results reported in this bz? Previously I was not having any trouble with these kernels until I updated the firmware in the Mellanox ConnectX device. Are you using older firmware? Either there is a regression in the firmware or regression in the kernel, or regression in their combination. Many other devices are reporting these problems, and the device driver developers seem to think it has to do with buggy firmware in those cases. Regards -steve
[root@schwoop ~]# uname -a Linux schwoop.usersys.redhat.com 2.6.40.3-0.fc15.x86_64 #1 SMP Tue Aug 16 04:10:59 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux [root@schwoop ~]# lspci | grep Mellanox 07:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) [root@schwoop ~]# mstflint -d 0000:07:00.0 q Image type: ConnectX FW Version: 2.9.1000 Device ID: 26428 Chip Revision: B0 Description: Node Port1 Port2 Sys image GUIDs: 0002c903000ba740 0002c903000ba741 0002c903000ba742 0002c903000ba743 MACs: 0002c90ba740 0002c90ba741 Board ID: (MT_0D80120009) VSD: PSID: MT_0D80120009 [root@schwoop ~]# To answer your questions, yes it's latest Fedora kernel, yes it's a recent Mellanox card, and yes it's the latest firmware. No DMAR spew. However, I'm on an AMD platform which uses a different IOMMU logic (if any at all, this is a single CPU workstation, not a server, but it does have 8GB of memory and the BIOS is configured to remap around the 4GB PCI hole and I'm pretty sure the IOMMU is turned on in the BIOS too). Now, it could still be a buggy firmware, but the issue appears to need an Intel based, big iron server to manifest. This puts at least a little bit of suspicion on the upstream Intel IOMMU code.
Could be. My server is Dell R710 for those interested.
Maybe related to your error. We use Mellanox cards with ESXi. In our case a MHGH19-XTC Connect-X (Gen 1) card shows the same behaviour. Our hardware consists of a Core i7 920 working in a Supermicro X8ST3 mainboard. From the following snippet you can see that this hardware is quite similar to Stevens: 000:000:00.0 Bridge: Intel Corporation 5520/5500/X58 I/O Hub to ESI Port [PCIe RP[000:000:00.0]] 000:000:01.0 Bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 [PCIe RP[000:000:01.0]] 000:000:03.0 Bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 [PCIe RP[000:000:03.0]] 000:000:05.0 Bridge: Intel Corporation 5520/X58 I/O Hub PCI Express Root Port 5 [PCIe RP[000:000:05.0]] 000:000:07.0 Bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 [PCIe RP[000:000:07.0]] 000:000:09.0 Bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 9 [PCIe RP[000:000:09.0]] ... We tested the following Firmware releases: 2.7.000 -> IB Link active 2.8.000 -> No Link -> Port only gets into armed state 2.9.1000 -> No Link -> Port only gets into armed state From my point of view and the above explanations the error must be related to firmware/bios issues. Best regards Markus
What are you connecting to the other side of the port? Can you try to connect port 1 to port 2 and see if the link comes up?
This is quite difficult, because we have only a one port infiniband card (MHGH19). What can I do provide more debug information? The card is connected to an Cisco SFS7000D switch with newest firmware. We disabled the included Subnet Manager and activated the newest opensm 3.3 in our infrastructure but no success at all. The only solution is to get back to Mellanox firmware 2.7.000.
So there is an interoperability problem between the two ports. I suggest that you pass this to customer support.
Eli, The issue I raised occurred after updating the firmware. The problem does not occur in older versions of Fedora/RHEL. Not sure how that is an interchangeability problem between ports. I cannot cross connect ports - the hardware is on the other side of the country n a lab I don't easily have access to. Regards -steve
Steve, where you're saying the problem did not occur in older versions of Fedora/RHEL, do you mean you checked with the new firmware? If you made two changes at once (upgrade both OS and firmware), then maybe you should do only one at a time.
Eli, I'm with Steven. The problem doesn't appear to be port compatibility related. We have two separate people with the same problem, both with Intel 5520/5500/X58 I/O Hubs, and we have my confirmed report that the latest firmware works with a non-Intel I/O Hub, and confirmation that the problem goes away if you use the iommu=off command line switch to the kernel. I would lay 10 to 1 odds this is one of three things: 1) an iommu bug in the Intel iommu driver, 2) a driver bug in the mlx4 driver related to iommu mapping that only manifests on the X58 I/O hub (possibly due to ordering or locking issues that only show up here) or 3) a firmware bug related to DMA transactions happening outside the temporal scope of the iommu mapping (which, again, might actually be a driver bug just uncovered by the later firmware). Do you have an Intel core i7/5520/X58 based system in house to reproduce?
(In reply to comment #14) > Steve, > where you're saying the problem did not occur in older versions of Fedora/RHEL, > do you mean you checked with the new firmware? It did not occur with older firmware. When he upgraded the firmware (which he first did in f13 I think), the problem started (and he did not upgrade fedora at the same time). The problem has persisted through f14 and f15. I'm pretty sure Steve has confirmed that a firmware downgrade on the card solves the problem in all three fedora versions. > If you made two changes at once (upgrade both OS and firmware), then maybe you > should do only one at a time. I'm positive that's not the case. He's down upgrade/downgrade testing of the firmware on a single kernel version and demonstrated the problem happens with the latest firmware and not with older firmware.
OK, so it's not related to OS at all. You should contact your FAE to continue handling this case.
Some additional info that may help (or not) We did some fresh installations on our "faulty" machine with the ConnectX card and firmware 2.9.1000. Booting with Ubuntu 10.10 or 11.10 and modprobing the required modules (mlx4_ib, ib_ipoib, ...) the device can easily connect to the IB infrastructure. Link comes up. Summing up our tests: VMWare ESXi 4.1 + latest driver + Firmware 2.7.000 -> Link VMWare ESXi 4.1 + latest driver + Firmware 2.9.1000 -> No Link Ubuntu 10.10. + driver of kernel 2.6.35 + Firmware 2.9.1000 -> Link Ubuntu 11.10. + driver of kernel 3.0 + Firmware 2.9.1000 -> Link Regarding IOMMU: here the dmesg lines of kernel 3.0.0 . Seems as if IOMMU is active. ... [ 0.011590] CPU0: Thermal monitoring enabled (TM1) [ 0.011597] using mwait in idle threads. [ 0.013326] ACPI: Core revision 20110413 [ 0.016224] ftrace: allocating 26000 entries in 102 pages [ 0.022980] DMAR: Host address width 39 [ 0.022984] DMAR: DRHD base: 0x000000fbffe000 flags: 0x1 [ 0.022994] IOMMU 0: reg_base_addr fbffe000 ver 1:0 cap c90780106f0462 ecap f02076 [ 0.022998] DMAR: RMRR base: 0x000000000ec000 end: 0x000000000effff [ 0.023001] DMAR: RMRR base: 0x000000df7ec000 end: 0x000000df7fffff [ 0.023003] DMAR: ATSR flags: 0x0 [ 0.023097] Switched APIC routing to physical flat. [ 0.023410] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 0.123172] CPU0: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz stepping 04 [ 0.238402] Performance Events: PEBS fmt1+, erratum AAJ80 worked around, Nehalem events, Intel PMU driver. ...
Eli, IMO the firmware is broken. I believe some change in kernel.org exposes this existing not-PCIe-complaint firmware issue, but kernel/firmware is not my area of expertise, so I could be wrong. I asked Doug to involve Mellanox since Mellanox would probably want to know your firmware doesn't work with new kernels. Please note that only Intel platforms exhibit this error (their iommu is different), not AMD platforms. F15 + mellanox new firmware = error RHEL6 + mellanox new firmware = NO error RHEL6 + old firmware = NO error F15 + old firmware = NO error Recognizing a pattern here, new firmware + new kernel.org = problems.
Eli, Responding to comment #17, I don't know what an FAE is, and I don't have time to contact anyone else regarding this matter. Best of luck -steve
Mellanox kindly provided us with different firmware images for our MHGH19-XTC card so we we are able to get a better picture. In our case the error starts with Firmware 2.7.9112. Latest working firmware is 2.7.8100. Disabling VT-d in the mainboard BIOS or VMWare ESXi 4.1 does not help to get a link with the card driven by a recent firmware. From the Mellanox support documents I can see the following changes for our ConnectX card in 2.7.9112: Bug fix: PCIe slow handling of configuration cycles may cause NMI New feature: Support for link speed/width changing via SET_PORT Changes that should only be relevant for ConnectX2 cards in this firmware release: New feature: Added PCIe Multifunction support New feature: Added WOL over Ethernet support @Steve: Maybe you could request the same firmware images for your card and make a short cross check.
FAE is Field Application Engineer. I will notify tech support of the issue.
We are looking internally for this issue. Meanwhile can you please let us know how many Cisco switches you have. And do you have any other IB switches you can test, or just connect the HCAs back-2-back to see if this is a comparability issue with the specific Cisco switch. Thanks Tziporet
Markus hijacked my bugzilla so I'm not sure who you want info from. I am not using a CISCO switch and have none to test with - I have a mellanox switch of some type. They cannot be connected back to back - they are in an engineering lab which I cannot physically access.
I'm sorry if I confused the thread with additional infos from our environment. But hopefully my feedback helps to identify Steves problems as he cannot change his setup. @Tziporet: Remember we can only reproduce the link failure on an X58 chipset VMWare ESX host. We have only one switch between the machines. Changing the switch did not help to correct the error. We changed over to a Flextronics F-X430060 SDR (should be based on Infiniscale III as the Cisco). The problem stays the same. No Link with Firmware 2.7.9112 or a higher version. I'll check if we have free ports on machine with a subnet manager so that we can test the requested direct link scenario.
Nice try but direct link does not do better. We removed the switch from our setup and connected two hosts with an infiniband cable directly. Setup is: Host 1: Xeon 5420 Intel 5400 Chipset Ubuntu Linux 10.04 LTS OpenSM Mellanox MHGH DDR IB card (Firmware 2.8.0000) Default drivers Host 2: Core i7 Intel X58 Chipset VMWare ESXi 4.1 Mellanox MHGH DDR IB card (Firmware 2.7.9112) Mellanox official drivers The link comes up as soon as we go back to firmware 2.7.8100 on the ESX Server.
Hi All. Do you still see this problem? Dotan
Hi All, I have encountered this exact issue on Fedora 17. I do not know if adding to this bug report is the correct course of action, however it did seem to still be open and totally relevant. The details (if it helps) are as follows: System is: SGI C1001 CPU is :Intel(R) Xeon(R) CPU X5570 @ 2.93GHz x 2 Mainboard uses Intel Corporation 5520/5500/X58 Chipset Infiniband adaptor is Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev a0) Firmware on IB card: fw-25408-2_9_1000-MHQH19-XTC_A1-A4.bin Kernel version is :3.4.4-5.fc17.x86_64 #1 SMP Thu Jul 5 20:20:59 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux And the repeating error code is: [ 6460.316497] DRHD: handling fault status reg 602 [ 6460.316504] DMAR:[DMA Read] Request device [07:00.6] fault addr 3043d0000 [ 6460.316506] DMAR:[fault reason 02] Present bit in context entry is clear Any testing you would like performed, please let me know.
Hello, you are right this issue has not been resolved yet. I guess we wont't see a fix of this until Mellanox provides a new firmware. So we are still on 2.7.8100 on the affected servers. Markus
This message is a notice that Fedora 14 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 14. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At this time, all open bugs with a Fedora 'version' of '14' have been closed as WONTFIX. (Please note: Our normal process is to give advanced warning of this occurring, but we forgot to do that. A thousand apologies.) Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, feel free to reopen this bug and simply change the 'version' to a later Fedora version. Bug Reporter: Thank you for reporting this issue and we are sorry that we were unable to fix it before Fedora 14 reached end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" (top right of this page) and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Guys , Hi Regarding the following messages :- [ 6460.316497] DRHD: handling fault status reg 602 [ 6460.316504] DMAR:[DMA Read] Request device [07:00.6] fault addr 3043d0000 [ 6460.316506] DMAR:[fault reason 02] Present bit in context entry is clear do you see them only when links does not go UP ? in other words can i connect the link issues to those error messages ? Thanks - Saeed
Saeed, the issue prevents the cards from being usable, and reliably so. However, a lack of link up for other reasons does not mean this problem will suddenly materialize. So I think it's fair to say that this problem causes a failure to get link up, but not the other way around. Additionally, the problem goes away with older firmware and is 100% reproducible on later firmware (see comment #21 for the exact version that introduces this issue). The problem here is that different distro kernels have different default settings for the Intel IOMMU support. In rhel (all versions so far), the Intel IOMMU driver is disabled by default and must be enabled by passing intel_iommu=on via the kernel command line. In earlier Fedora kernels, the Intel IOMMU is off by default, and in later Fedora kernels it is enabled by default. So, here are the requirements as I understand them to reproduce this issue: Intel IOMMU present and enabled in BIOS Intel IOMMU driver enabled in kernel mlx4 driver loaded later mlx4 firmware on card (2.7.9112 or later) However (and this is an important however!), the silence of this issue when the IOMMU is not enabled does not mean this bug can be ignored! There are two possible causes here: 1) a bug in the Intel IOMMU driver and/or hardware or 2) a bug in the mlx4 driver and/or firmware. If the answer to this problem is #2, then we have a silent data corrupter on our hands, and the fact that the Intel IOMMU driver catches it does not mean it doesn't happen elsewhere, it just means that without the Intel IOMMU enabled we don't know when it happens. So please take a look at the difference between the 2.7.8100 and 2.7.9112 firmware source code and root cause why updating to that firmware causes this bug to suddenly appear (whether because the driver is doing something the firmware doesn't like, or whether the firmware is doing something the driver doesn't like, or because the Intel IOMMU is getting confused by something the firmware is doing, or whatever).
Doug, I believe this is related somehow to SRIOV implementation in FW , since the DMA Read error messages shows that "someone" is trying to read an address of non-existing virtual function, looking at the bug description :- [ 1680.967048] DMAR:[DMA Read] Request device [04:00.6] fault addr f647a000 while the HCA is configured as single function :- 04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s] since you said this is 100% reproducible, I will try to reproduce and debug. I will update ASAP. Thanks for the helpful information -Saeed
Saeed, could you please give us an update on this? You said "the HCA is configured as single function." Is this something user configurable? Does configuring the HCA as multi-function resolve the problem?
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19
What's happening with this bug? If this is still present in latest Mellanox firmware, I guess they're not actively working on a fix. If there's a need for it, it would be possible to handle this with a quirk, similar to what's being done for other devices with buggy DMA source tags.
Hi Andrew and sorry for the late response > You said "the HCA is configured as single function." Is this something user configurable? Does configuring the HCA as multi-function resolve the problem? you will need to upgrade the driver in order to enable sriov. also the native driver should work even with SRIOV capable FW. anyway I tried in the past also this week to reproduce the issue with no success: my configuration : uname -a Linux reg-ovm-036 2.6.38.6-26.rc1.fc15.x86_64 #1 SMP Mon May 9 20:45:15 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux FW : 2.9.1000 CPU: Intel(R) Xeon(R) CPU E5606 server : product: X8DTN (1234567890) vendor: Supermicro version: 1234567890 serial: 1234567890 width: 64 bits IOMMU is enabled : [ 0.000000] Command line: root=UUID=28c5fb12-fcf3-46dc-9ee4-f57c2735d661 rhgb iommu=on [ 0.000000] Kernel command line: root=UUID=28c5fb12-fcf3-46dc-9ee4-f57c2735d661 rhgb iommu=on [ 0.085235] IOMMU 0: reg_base_addr fbffe000 ver 1:0 cap c90780106f0462 ecap f020f6 [ 1.381990] IOMMU 0 0xfbffe000: using Queued invalidation [ 1.382283] IOMMU: Setting RMRR: [ 1.382580] IOMMU: Setting identity map for device 0000:00:1d.0 [0xbf7ec000 - 0xbf800000] [ 1.383171] IOMMU: Setting identity map for device 0000:00:1d.1 [0xbf7ec000 - 0xbf800000] [ 1.383737] IOMMU: Setting identity map for device 0000:00:1d.2 [0xbf7ec000 - 0xbf800000] [ 1.384307] IOMMU: Setting identity map for device 0000:00:1d.7 [0xbf7ec000 - 0xbf800000] [ 1.384872] IOMMU: Setting identity map for device 0000:00:1a.0 [0xbf7ec000 - 0xbf800000] [ 1.385443] IOMMU: Setting identity map for device 0000:00:1a.1 [0xbf7ec000 - 0xbf800000] [ 1.386014] IOMMU: Setting identity map for device 0000:00:1a.2 [0xbf7ec000 - 0xbf800000] [ 1.386572] IOMMU: Setting identity map for device 0000:00:1a.7 [0xbf7ec000 - 0xbf800000] [ 1.387127] IOMMU: Setting identity map for device 0000:00:1d.0 [0xe6000 - 0xea000] [ 1.387660] IOMMU: Setting identity map for device 0000:00:1d.1 [0xe6000 - 0xea000] [ 1.388198] IOMMU: Setting identity map for device 0000:00:1d.2 [0xe6000 - 0xea000] [ 1.388728] IOMMU: Setting identity map for device 0000:00:1d.7 [0xe6000 - 0xea000] [ 1.389264] IOMMU: Setting identity map for device 0000:00:1a.0 [0xe6000 - 0xea000] [ 1.389796] IOMMU: Setting identity map for device 0000:00:1a.1 [0xe6000 - 0xea000] [ 1.390333] IOMMU: Setting identity map for device 0000:00:1a.2 [0xe6000 - 0xea000] [ 1.390862] IOMMU: Setting identity map for device 0000:00:1a.7 [0xe6000 - 0xea000] [ 1.391401] IOMMU: Prepare 0-16MiB unity mapping for LPC [ 1.391700] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0x1000000] dmesg | grep -i DMAR [ 0.000000] ACPI: DMAR 00000000bf77e0d0 00140 (v01 AMI OEMDMAR 00000001 MSFT 00000097) [ 0.084649] DMAR: Host address width 40 [ 0.084941] DMAR: DRHD base: 0x000000fbffe000 flags: 0x1 [ 0.085750] DMAR: RMRR base: 0x000000000e6000 end: 0x000000000e9fff [ 0.086046] DMAR: RMRR base: 0x000000bf7ec000 end: 0x000000bf7fffff [ 0.086332] DMAR: ATSR flags: 0x0 is this the correct configuration required to reproduce ?
Saeed, I'm pretty sure you need intel_iommu=on on the kernel command line to reproduce.
This message is a notice that Fedora 19 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 19. It is Fedora's policy to close all bug reports from releases that are no longer maintained. Approximately 4 (four) weeks from now this bug will be closed as EOL if it remains open with a Fedora 'version' of '19'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 19 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.