Bug 889749 - IOMMU / AMD Vi Event: IO PAGE FAULT causes gbit NIC lockups
Summary: IOMMU / AMD Vi Event: IO PAGE FAULT causes gbit NIC lockups
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 18
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-12-23 09:17 UTC by bob
Modified: 2013-05-15 06:02 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2013-04-08 13:16:51 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg output (82.92 KB, text/plain)
2012-12-23 09:17 UTC, bob
no flags Details
Kernel log from startup till restart with the problem visible (98.45 KB, text/x-log)
2013-04-05 18:48 UTC, Erkki Seppälä
no flags Details

Description bob 2012-12-23 09:17:31 UTC
Created attachment 667943 [details]
dmesg output

Description of problem:

AMD box with IOMMU and virtualized I/O: When transferring large files using gigabit ethernet, the transfers will stall as the ethernet card locks up.  An error message is displayed on the console stating that there has been an IOMMU IO PAGE FAULT related to the NIC.  The NIC is locked up and unresponsive, requiring a system reboot.

Version-Release number of selected component (if applicable):

Linux 3.6.11-3.fc18.x86_64 #1 SMP Mon Dec 17 21:35:39 UTC 2012 x86_64

How reproducible:

Variable amount of data transferred before failure, but page faults always occur if you transfer enough data at Gbit speed.

Steps to Reproduce:
1. Enable IOMMU
2. Transfer large files ( > 1 GB in size) over LAN using RSYNC or SCP at gigabit speed.
3. Observe page fault and NIC lockup.
  
Actual results:

Ethernet lockups caused by I/O page faults in IOMMU.

Expected results:

Error free file transfers.

Additional info:

# uname -srvp
Linux 3.6.11-3.fc18.x86_64 #1 SMP Mon Dec 17 21:35:39 UTC 2012 x86_64

# cat /var/log/messages | grep "AMD-Vi"
Dec 23 01:12:03 kernel: [    1.004407] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
Dec 23 01:12:03 kernel: [    1.012109] AMD-Vi: Lazy IO/TLB flushing enabled
Dec 23 01:46:22 kernel: [ 2080.961658] AMD-Vi: Event logged [IO_PAGE_FAULT device=02:00.0 domain=0x0014 address=0x0000000000003000 flags=0x0050]

# lspci | grep "02:00.0"
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 09)

# dmesg
(see attachment)

These errors were initially encountered with both IOMMU and HPET enabled in BIOS.  Disabling HPET in BIOS while leaving IOMMU enabled did not solve the problem.  It would appear that the only way to prevent this bug from occurring would be to disable IOMMU.  (Not a desirable option.)

These problems are occurring on a new AMD FX-8350 8-core CPU with an ASUS M5A97 motherboard that has the AMD 970 Northbridge and 950 southbridge chipsets.   I have been using the setup without issue for a couple of weeks at fast ethernet (100 mbit) speeds.  The problem became manifest today when I upgraded from a 10/100 switch to a Gigabit switch and started doing backups & large file transfer benchmarks on the LAN.  The problem is not in the switch; it only occurs during transfers on this box.  The other boxes run different OS on Intel platforms and are not effected.

The Xen guys have been working on problems related to AMD-Vi, IOMMU and IO PAGE FAULTS.

Links that I hope may help:

http://support.amd.com/us/Embedded_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf

http://permalink.gmane.org/gmane.comp.emulators.xen.devel/138886

http://lists.xen.org/archives/html/xen-devel/2012-09/msg01859.html

Comment 1 Josh Boyer 2013-04-01 20:25:11 UTC
Are you still seeing this with 3.8.x?

Comment 2 Erkki Seppälä 2013-04-05 18:47:19 UTC
I'm not a Fedora user, but I noticed that my problem seems very similar to this one, so if it is the same problem, it hasn't been solved for all cases. I'm running kernel 3.8.5. (From Debian experimental.) I cannot disable IOMMU totally as for some reason that breaks my USB. Disabling HPET has no effect. I'm yet to try reducing link speed to 100Mbit, but that would not be a solution :). The problem occurs only when transferring data out, never when transferring data in.

I too am running a board (Asus Sabertooth 990FX 2.0) with RealTek R8168 and AMD FX-8350. The chipset is 990FX/SB950. The system has 16 GB memory installed.

Fragment from my kernel logs:

Apr  5 21:26:58 aiee kernel: [  288.814737] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x001e address=0x0000000000003000 flags=0x0050]
Apr  5 21:27:17 aiee kernel: [  307.928142] ------------[ cut here ]------------
Apr  5 21:27:17 aiee kernel: [  307.928155] WARNING: at /build/buildd-linux_3.8.5-1~experimental.1-amd64-_t_ZfP/linux-3.8.5/net/sched/sch_generic.c:254 dev_watchdog+0xe3/0x153()
Apr  5 21:27:17 aiee kernel: [  307.928159] Hardware name: To be filled by O.E.M.
Apr  5 21:27:17 aiee kernel: [  307.928163] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out

aiee# uname -a
Linux aiee 3.8-trunk-amd64 #1 SMP Debian 3.8.5-1~experimental.1 x86_64 GNU/Linux

aiee# lspci | grep 0a:00.0    
0a:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 09)

aiee# dmesg (in attachment)

Comment 3 Erkki Seppälä 2013-04-05 18:48:21 UTC
Created attachment 731992 [details]
Kernel log from startup till restart with the problem visible

Comment 4 bob 2013-04-05 23:11:32 UTC
> Are you still seeing this with 3.8.x?

I wouldn't know.  I filed this bug three months ago in December 2012 and never got a response.  When it was evident that nobody cared enough to respond to the bug report I gave up on getting support for the on-board chipset and bought an Intel Gigabit PCIE card.  Now things work.  Now I have no reason to worry about the kernel/Realtek driver problem.

Sorry I can't help, but I couldn't wait three months for someone to even acknowledge that the problem exists.

Comment 5 Erkki Seppälä 2013-04-10 04:23:37 UTC
I should add that kernel option iommu=pt resolved my problem. I'm not sure if that is going to have impact that might have on making use of KVM, though.

Comment 6 Concerned Citizen 2013-05-15 01:52:38 UTC
This bug has nothing to do with IOMMU. Real cause is Realtek driver. 

Here is the real bug and patch to fix it (kudos to Francois Romieu):
https://bugzilla.kernel.org/show_bug.cgi?id=14962

Comment 7 Erkki Seppälä 2013-05-15 05:48:39 UTC
That change was integrated to v3.5-rc2-237-geb2dc35 and the problem still persisted, so it doesn't seem to be the root cause.

Comment 8 Erkki Seppälä 2013-05-15 05:59:13 UTC
..but maybe a similar fix by simply enumerating the version in the switch-case statement would apply here. I don't promise to try it, though :), the iommu=pt kernel switch has indeed been a 100% workaround for the issue for me.

After some digging I find that the my card (one of 8168F family) would be either RTL_GIGA_MAC_VER_35 or RTL_GIGA_MAC_VER_36 and the patch only is for RTL_GIGA_MAC_VER_34, so it may very well be the solution. Should've realized it earlier, I had seen the patch :(. Thanks for the pointer!

If it turns out to be the case then it should really be a module option (as well) so people can easily try it out.

Comment 9 Erkki Seppälä 2013-05-15 06:02:06 UTC
..but maybe a similar fix by simply enumerating the version in the switch-case statement would apply here. I don't promise to try it, though :), the iommu=pt kernel switch has indeed been a 100% workaround for the issue for me.

After some digging I find that the my card (one of 8168F family) would be either RTL_GIGA_MAC_VER_35 or RTL_GIGA_MAC_VER_36 and the patch only is for RTL_GIGA_MAC_VER_34, so it may very well be the solution. Should've realized it earlier, I had seen the patch :(. Thanks for the pointer!

If it turns out to be the case then it should really be a module option (as well) so people can easily try it out.


Note You need to log in before you can comment on or make changes to this bug.