Created attachment 667943 [details] dmesg output Description of problem: AMD box with IOMMU and virtualized I/O: When transferring large files using gigabit ethernet, the transfers will stall as the ethernet card locks up. An error message is displayed on the console stating that there has been an IOMMU IO PAGE FAULT related to the NIC. The NIC is locked up and unresponsive, requiring a system reboot. Version-Release number of selected component (if applicable): Linux 3.6.11-3.fc18.x86_64 #1 SMP Mon Dec 17 21:35:39 UTC 2012 x86_64 How reproducible: Variable amount of data transferred before failure, but page faults always occur if you transfer enough data at Gbit speed. Steps to Reproduce: 1. Enable IOMMU 2. Transfer large files ( > 1 GB in size) over LAN using RSYNC or SCP at gigabit speed. 3. Observe page fault and NIC lockup. Actual results: Ethernet lockups caused by I/O page faults in IOMMU. Expected results: Error free file transfers. Additional info: # uname -srvp Linux 3.6.11-3.fc18.x86_64 #1 SMP Mon Dec 17 21:35:39 UTC 2012 x86_64 # cat /var/log/messages | grep "AMD-Vi" Dec 23 01:12:03 kernel: [ 1.004407] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40 Dec 23 01:12:03 kernel: [ 1.012109] AMD-Vi: Lazy IO/TLB flushing enabled Dec 23 01:46:22 kernel: [ 2080.961658] AMD-Vi: Event logged [IO_PAGE_FAULT device=02:00.0 domain=0x0014 address=0x0000000000003000 flags=0x0050] # lspci | grep "02:00.0" 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 09) # dmesg (see attachment) These errors were initially encountered with both IOMMU and HPET enabled in BIOS. Disabling HPET in BIOS while leaving IOMMU enabled did not solve the problem. It would appear that the only way to prevent this bug from occurring would be to disable IOMMU. (Not a desirable option.) These problems are occurring on a new AMD FX-8350 8-core CPU with an ASUS M5A97 motherboard that has the AMD 970 Northbridge and 950 southbridge chipsets. I have been using the setup without issue for a couple of weeks at fast ethernet (100 mbit) speeds. The problem became manifest today when I upgraded from a 10/100 switch to a Gigabit switch and started doing backups & large file transfer benchmarks on the LAN. The problem is not in the switch; it only occurs during transfers on this box. The other boxes run different OS on Intel platforms and are not effected. The Xen guys have been working on problems related to AMD-Vi, IOMMU and IO PAGE FAULTS. Links that I hope may help: http://support.amd.com/us/Embedded_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf http://permalink.gmane.org/gmane.comp.emulators.xen.devel/138886 http://lists.xen.org/archives/html/xen-devel/2012-09/msg01859.html
Are you still seeing this with 3.8.x?
I'm not a Fedora user, but I noticed that my problem seems very similar to this one, so if it is the same problem, it hasn't been solved for all cases. I'm running kernel 3.8.5. (From Debian experimental.) I cannot disable IOMMU totally as for some reason that breaks my USB. Disabling HPET has no effect. I'm yet to try reducing link speed to 100Mbit, but that would not be a solution :). The problem occurs only when transferring data out, never when transferring data in. I too am running a board (Asus Sabertooth 990FX 2.0) with RealTek R8168 and AMD FX-8350. The chipset is 990FX/SB950. The system has 16 GB memory installed. Fragment from my kernel logs: Apr 5 21:26:58 aiee kernel: [ 288.814737] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x001e address=0x0000000000003000 flags=0x0050] Apr 5 21:27:17 aiee kernel: [ 307.928142] ------------[ cut here ]------------ Apr 5 21:27:17 aiee kernel: [ 307.928155] WARNING: at /build/buildd-linux_3.8.5-1~experimental.1-amd64-_t_ZfP/linux-3.8.5/net/sched/sch_generic.c:254 dev_watchdog+0xe3/0x153() Apr 5 21:27:17 aiee kernel: [ 307.928159] Hardware name: To be filled by O.E.M. Apr 5 21:27:17 aiee kernel: [ 307.928163] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out aiee# uname -a Linux aiee 3.8-trunk-amd64 #1 SMP Debian 3.8.5-1~experimental.1 x86_64 GNU/Linux aiee# lspci | grep 0a:00.0 0a:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 09) aiee# dmesg (in attachment)
Created attachment 731992 [details] Kernel log from startup till restart with the problem visible
> Are you still seeing this with 3.8.x? I wouldn't know. I filed this bug three months ago in December 2012 and never got a response. When it was evident that nobody cared enough to respond to the bug report I gave up on getting support for the on-board chipset and bought an Intel Gigabit PCIE card. Now things work. Now I have no reason to worry about the kernel/Realtek driver problem. Sorry I can't help, but I couldn't wait three months for someone to even acknowledge that the problem exists.
I should add that kernel option iommu=pt resolved my problem. I'm not sure if that is going to have impact that might have on making use of KVM, though.
This bug has nothing to do with IOMMU. Real cause is Realtek driver. Here is the real bug and patch to fix it (kudos to Francois Romieu): https://bugzilla.kernel.org/show_bug.cgi?id=14962
That change was integrated to v3.5-rc2-237-geb2dc35 and the problem still persisted, so it doesn't seem to be the root cause.
..but maybe a similar fix by simply enumerating the version in the switch-case statement would apply here. I don't promise to try it, though :), the iommu=pt kernel switch has indeed been a 100% workaround for the issue for me. After some digging I find that the my card (one of 8168F family) would be either RTL_GIGA_MAC_VER_35 or RTL_GIGA_MAC_VER_36 and the patch only is for RTL_GIGA_MAC_VER_34, so it may very well be the solution. Should've realized it earlier, I had seen the patch :(. Thanks for the pointer! If it turns out to be the case then it should really be a module option (as well) so people can easily try it out.