Bug 787054 - DMA: Out of SW-IOMMU space for 16 bytes
Summary: DMA: Out of SW-IOMMU space for 16 bytes
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Neil Horman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-03 02:42 UTC by Juan Urroa
Modified: 2013-04-12 14:51 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
: 924733 (view as bug list)
Environment:
Last Closed: 2012-11-14 15:19:53 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
stap script to track dma applications (405 bytes, application/octet-stream)
2012-03-14 15:03 UTC, Neil Horman
no flags Details

Description Juan Urroa 2012-02-03 02:42:00 UTC
Description of problem:
From time to time I get this message
 kernel: [120431.630837] DMA: Out of SW-IOMMU space for 16 bytes at device 0000:00:1d.7
 in /var/log/messages and dmesg, the message repeats thousands of times and the messages file grows to hundreds of megabytes.
Eventualy I have to reboot most of the times because a panic ocurs and the fewest because I can't use the network usb,

Version-Release number of selected component (if applicable):

2.6.41.10-3.fc15.x86_64
but the problem has ocurred in other kernel versions

How reproducible:
It happens after several hours or days of continuous use (its a home server)

Steps to Reproduce:
1. Leave the system turned on
  
Actual results:
The system panics

Expected results:
The system should run as long as required

Additional info:
I think the device is a usb hub

Comment 1 Stanislaw Gruszka 2012-02-05 17:53:49 UTC
If you install kernel-debug, does it print some warnings?

Comment 2 Juan Urroa 2012-02-05 20:04:47 UTC
Sorry I don't have experience with kernel debugging, what do I have to do? just install the package and look for warnings when the problem occurs?
Thanks

Comment 3 Stanislaw Gruszka 2012-02-06 07:45:44 UTC
Yes, just do "yum install kernel-debug", boot the new installed kernel, run dmesg command from time to time and see if there is "WARNING: at" message. Warning  should be also detected by ABRT tool.

Comment 4 Dave Jones 2012-02-10 17:17:24 UTC
try the kernel currently in updates-testing too. It's been rebased to a newer upstream release, so may have fixes in this area.

Comment 5 Juan Urroa 2012-02-11 03:57:19 UTC
After two days running the debug kernel i found this warnings

[  135.692479] avahi-daemon[1469]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns!
[  689.292402] WARNING: at drivers/net/wireless/ath/ath9k/htc_drv_txrx.c:501 ath9k_htc_tx_process+0x3bb/0x3d0 [ath9k_htc]()


I will try the new kernel tomorrow

Comment 6 Juan Urroa 2012-02-21 18:43:58 UTC
After trying 2.6.42.3-2.fc15.x86_64.debug igot the same warnings:

[  146.880902] avahi-daemon[1441]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns!
[  446.314648] WARNING: at drivers/net/wireless/ath/ath9k/htc_drv_txrx.c:501 ath9k_htc_tx_process+0x3bb/0x3d0 [ath9k_htc]()

The error still doesnt appear but sometimes several days pass without error

Comment 7 Robin Rainton 2012-03-07 09:18:58 UTC
I'm getting something similar too. This is on a Lenovo X220. FC16, kernel:

3.2.7-1.fc16.x86_64 #1 SMP Tue Feb 21 01:40:47 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

This message is repeated in /var/log/messages about 10 times a second:

DMA: Out of SW-IOMMU space for 92 bytes at device 0000:03:00.0

lspci shows the problem appears to be with the Realtek wireless LAN (wifi) adapter, which makes sense as wireless networking stops working at this point:

# lspci
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b4)
00:1c.1 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 2 (rev b4)
00:1c.3 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 4 (rev b4)
00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5 (rev b4)
00:1c.6 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 7 (rev b4)
00:1d.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation QM67 Express Chipset Family LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 04)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 04)
03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8188CE 802.11b/g/n WiFi Adapter (rev 01)
0d:00.0 System peripheral: Ricoh Co Ltd Device e823 (rev 07)
0e:00.0 USB Controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 04)

There seems to be no way to fix this but to reboot. The problem begins to happen at random times after a reboot.

Comment 8 Robin Rainton 2012-03-09 18:11:54 UTC
Just updated and can confirm the problem persists with:

3.2.9-1.fc16.x86_64 #1 SMP Thu Mar 1 01:41:10 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Comment 9 Neil Horman 2012-03-09 19:08:43 UTC
It would seem your loosing dma space either to a leak in the driver, or to simple exhaustion of the available space.  Can you tell if this exhaustion is coupled with any other sort of behavior (an interuption of wireless service perhaps that would cause the tx queue to back up, or something of that nature?  the rtl tx queue is 128 skbs long.  If that queue backs up I could imagine how the the swiotlb space might get exhausted, especially if you have other devices competing for it

Comment 10 Robin Rainton 2012-03-09 19:54:07 UTC
(In reply to comment #9)
> Can you tell if this exhaustion is coupled with any other sort of behavior...

Thanks for helping, but sadly I have no idea how to check these things. If someone can give some commands to enter that could show some sort of debugging that could help here I'm happy to try them though.

Comment 11 Neil Horman 2012-03-09 20:53:50 UTC
Its going to be tough to check, its the sort of thing you'd likely want a stap script for.  I'll see what I can write up on it.

Comment 12 Juan Urroa 2012-03-14 01:24:25 UTC
With 2.6.42.3-2.fc15.x86_64 I'm having an uptime of 22 days, maybe the original problem is solved

Comment 13 Robin Rainton 2012-03-14 07:59:19 UTC
Sorry, but I'm seeing these messages still with...

3.2.9-2.fc16.x86_64 #1 SMP Mon Mar 5 20:55:39 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

... so if it ws fixed in a 2.6 it's back in 3.2.

Also, I had previously only seen the issue with a WiFi device, but plugged a USB thumb drive in the other day while watching the log and notice that also caused the, "DMA: Out of SW-IOMMU space" message on a different device ID so it seems the problem is not limited to just WiFi, at least in my setup (Lenovo X220, FC16).

Comment 14 Neil Horman 2012-03-14 14:18:50 UTC
Hm, I assume you mean to indicate that, in comment 13, you didn't have the wifi device in use?  If so, that suggests you have some other piece of hardware sucking up your swiotlb space.

Comment 15 Robin Rainton 2012-03-14 14:28:29 UTC
Yes, I also saw...

Mar 10 15:04:33 x220 kernel: [ 4502.707910] DMA: Out of SW-IOMMU space for 3 bytes at device 0000:00:1a.0

From my previous lspci you can see that device 1a.0 is a USB controller.

Comment 16 Neil Horman 2012-03-14 14:46:52 UTC
well, yes, I know about the USB controller from comment 13.  what I'm saying is if, previuosly you saw your wifi NIC getting errors, but didn't have this USB device plugged in, and now are seeing this error on your usb device without using your wifi NIC, then you most likely have smoe third device that is hogging all the space in the software iommu. It is possibly allocating a huge chunk of dma space and never releasing it, causing other devices in your system to run out of dma-able space in the iommu.  I'm writing a stap script to track this down now.

Comment 17 Neil Horman 2012-03-14 15:03:47 UTC
Created attachment 570008 [details]
stap script to track dma applications

Here you go.  This is a stap script to track dma memory allocations in the software iommu.  Please boot your system and run this script under systemtap.  Any hardware that you have which might use DMA please try to keep dormant until after you start the stap script. If you send me the output I can take a look and see whats eating all your swiotlb memory.

Comment 18 Robin Rainton 2012-03-15 18:22:14 UTC
Re: #17, I assume one is meant to use this script with 'stap dma.stap'? I tried that and got...

ERROR: kernel read fault at 0x          (null) (addr) near identifier '$hwdev' at dma.stap:4:3
WARNING: Number of errors: 1, skipped probes: 1
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  Try again with another '--vp 00001' option.

Sorry once more if I've done this wrong - it's all new to me!

Comment 19 Neil Horman 2012-03-15 19:38:31 UTC
yes, thats all you should have to do.  Sounds like you don't have the debuginfo packages for your running kernel installed.

Comment 20 Robin Rainton 2012-03-16 11:25:49 UTC
Hmmm... well... call me a n00b at kernel debugging but just can't get this to work.

I have all these packages installed:

kernel.x86_64                           3.2.9-2.fc16           @updates         
kernel-debug.x86_64                     3.2.9-2.fc16           @updates         
kernel-debug-debuginfo.x86_64           3.2.9-2.fc16           @updates-debuginfo
kernel-debug-devel.x86_64               3.2.9-2.fc16           @updates         
kernel-debuginfo.x86_64                 3.2.9-2.fc16           @updates-debuginfo
kernel-debuginfo-common-x86_64.x86_64   3.2.9-2.fc16           @updates-debuginfo
kernel-devel.x86_64                     3.2.9-2.fc16           @updates         
kernel-headers.x86_64                   3.2.9-2.fc16           @updates         
kernel-tools.x86_64                     3.2.9-2.fc16           @updates         
kernel-tools-debuginfo.x86_64           3.2.9-2.fc16           @updates-debuginfo
kernel-tools-devel.x86_64               3.2.9-2.fc16           @updates  

It doesn't matter if I boot into the standard kernel or the debug kernel am still seeing that 'read fault' error when trying to execute stap script :(

Any idea what I'm doing wrong?

Comment 21 Neil Horman 2012-03-16 12:46:43 UTC
not sure, its working fine for me under only a slightly more recent kernel.  You can try running stap with --skip-badvars to see if that helps, its possible that hwdev is NULL when its passed into the swiotlb code, but that would be bad for lots of reasons.

Comment 22 Robin Rainton 2012-03-19 07:55:57 UTC
Well - I just can't get any sense out of stap :( With the '--skip-badvars' some output does come but there are the percent placeholders shown rather than any values. Also, the output doesn't seem to vary based on the problem being present or not.

I'm now on 3.2.10-3.fc16.x86_64 and the problem persists.

One thing I have noticed is this:

I'm using a bridge between Wifi and wired LAN connection on an X220 laptop. hostapd is running to turn the Wifi into a hotspot and the bridge allows traffic between clients on the hotspot and the main LAN. When clients are near to the laptop then data rates would be assumed to be good, when the signal gets weaker you would assume rates to drop. Interestingly I note that it takes a lot longer for the SW-IOMMU issue to occur when the clients (2 smartphones) have a good signal.

That is, when one or both are very far from the laptop hotspot with a poor signal, the SW-IOMMU problem occurs quite quickly (within an hour perhaps).

Could this be related to buffering of packets in the bridge? I wouldn't have a clue but it seems very suspicious. To be honest though, if some buffer were filling how come the system cannot recover after that is drained? When the SW-IOMMU issue occurs, the hotspot stops working and the clients disconnect, but... the messages keep flowing from the log. The wired LAN connection carries on functioning just fine but the only way to get the Wifi adapter to work once more is to reboot.

I'm currently travelling and only have the laptop to use as a Wifi hotspot and this is a bit frustrating.

Does this help at all?

Comment 23 Neil Horman 2012-03-19 12:40:06 UTC
Unfortunately no, it doesn't help much.  Without any visibility into the problem its difficult to tell whats going wrong on your system.  If stap just isn't working for you, I can find some time to build you a kernel with some extra debug information.

In the interim, if you feel like the problem is definately related to your wireless interface, the above description wouldn't point me to the bridge, but rather the wireless hardware itself.  My guess would be the wireless NIC your using is having to retransmit packets as your radio quality degrades, and that extra time (coulpled with what is likely a deep hardware queue, is causing the lifetime of a dma-mapped frame to grow, to the point where we are running out of space.  I would recommend using iwconfig to shorten the number of retries per packet to see if dropping those retransmitted frames earlier prevents the problem from occuring.

Comment 24 Neil Horman 2012-04-09 17:53:13 UTC
ping any feedback on the suggestions in comment 23?

Comment 25 Robin Rainton 2012-04-10 15:33:27 UTC
I removed the bridge and used a NAT solution instead but the problem persists.

I also note that these issues can still occur (albeit less frequently) when USB devices are used.

Am now on 3.3.1-3.fc16.x86_64 #1 SMP Wed Apr 4 18:08:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Comment 26 Neil Horman 2012-04-10 16:41:15 UTC
Thats great, but it does nothing in relation to the suggestions I gave in comment 23.  Could you please try using iwconfig to shorten the the number of retries per packet on your wireless interface

Comment 27 Robin Rainton 2012-04-10 18:12:08 UTC
I tried this with no noticeable effect:

ifconfig wlan0 txqueuelen 10000
iwconfig wlan0 retry 1 rts 1024

Comment 28 Neil Horman 2012-04-10 20:04:06 UTC
Why did you make the txqueuelen so large?  Thats going to undo any effects from the retry changes.  Having 10000 frames backed up is going to wind up causing simmilar problems.

Comment 29 Robin Rainton 2012-04-10 20:17:05 UTC
The rationale was that if there was a problem with signals causing the network to back up, and only when that back up overflowed something the problem was caused, then making the queue larger would help.

I will try again with the default queue.

Comment 30 Robin Rainton 2012-04-10 21:48:46 UTC
Nope. This on it's own (without the queuelen change) has no effect on the problem, which still persists:

iwconfig wlan0 retry 1 rts 1024

Comment 31 Neil Horman 2012-04-12 14:44:33 UTC
hmm, thats odd.  That points back to this not being a wireless networking problem.  If you don't use the wireless NIC, the problem doesn't occur though, right?

Comment 32 Robin Rainton 2012-04-12 16:45:44 UTC
All I can say is that this problem occurs more often when wireless is in use, but I have seen it occur from time to time when using USB devices.

It would be true to say that this does not appear to be related solely to the wireless hardware then.

Comment 33 Josh Boyer 2012-09-04 17:06:55 UTC
Is this still happening with the 3.4 or 3.5 kernel updates?

Comment 34 Dave Jones 2012-10-23 15:33:13 UTC
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 35 Justin M. Forbes 2012-11-14 15:19:53 UTC
With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.

Comment 36 Jan Jurko 2013-03-17 15:24:02 UTC
Hi. I wanted to create a new bugreport but this one seems to suit my problem.

OS: Fedora 18 x86_64
HW: Lenovo ThinkPad E320
RAM: 8G (upgraded from 4)

Linux jarvis 3.8.3-201.fc18.x86_64 #1 SMP Thu Mar 14 21:28:05 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I am not able to upload any data bigger than a few megs to the target storage over the ethernet (1Gbit nic, 1Gbit connection).

I tried:
scp, nfs, sshfs, ftp

When I start the upload, the speed goes rapidly down and dmesg starts to generate thousand messages like:

DMA: Out of SW-IOMMU space for 1448 bytes at device 0000:08:00.0

Number of bytes changes.

The speed goes down from MB to kB and if I let it continue, system will freeze.

In fedora kernel 3.7.7 is everything ok. In 3.8.1, 3.8.2 and 3.8.3 (see my uname above) it is not ok.

lspci:

00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b4)
00:1c.1 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 2 (rev b4)
00:1c.2 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 3 (rev b4)
00:1c.5 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 6 (rev b4)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation HM65 Express Chipset Family LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 04)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 04)
02:00.0 Network controller: Intel Corporation Centrino Wireless-N 1000 [Condor Peak]
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5209 PCI Express Card Reader (rev 01)
03:00.1 SD Host controller: Realtek Semiconductor Co., Ltd. RTS5209 PCI Express Card Reader (rev 01)
08:00.0 Ethernet controller: Atheros Communications Inc. AR8151 v2.0 Gigabit Ethernet (rev c0)


In kernel 3.7.7 - never happend
In kernels 3.8.1, 3.8.3, 3.8.3 - always happend with high network traffic

Thank you for help.

JJ

Comment 37 ultima.ratio.regum69 2013-03-17 16:13:38 UTC
Hi, same problem as Jan Jurko

OS: Fedora 18 x86_64
lspci:
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.3 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 4 (rev c4)
00:1c.5 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c4)
00:1c.6 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 7 (rev c4)
00:1c.7 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 8 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation Z77 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
01:00.0 VGA compatible controller: NVIDIA Corporation Device 11c0 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 0e0b (rev a1)
03:00.0 PCI bridge: PLX Technology, Inc. PEX8112 x1 Lane PCI Express-to-PCI Bridge (rev aa)
04:04.0 Multimedia audio controller: C-Media Electronics Inc CMI8788 [Oxygen HD Audio]
05:00.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 30)
07:00.0 Ethernet controller: Atheros Communications Inc. AR8151 v2.0 Gigabit Ethernet (rev c0)
08:00.0 USB controller: Etron Technology, Inc. EJ168 USB 3.0 Host Controller (rev 01)

it starts with message like :
DMA: Out of SW-IOMMU space for 30432 bytes at device 0000:07:00.0
then 
DMA: Out of SW-IOMMU space for 2336 bytes at device 0000:07:00.0
and
DMA: Out of SW-IOMMU space for 54 bytes at device 0000:07:00.0

after that, a reboot is needed.


Note You need to log in before you can comment on or make changes to this bug.