Bug 475156 - Out of SW-IOMMU space: External hard disk inaccessible
Summary: Out of SW-IOMMU space: External hard disk inaccessible
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 9
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
Assignee: Jarod Wilson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-12-08 08:52 UTC by Emmanuel Kowalski
Modified: 2009-07-26 12:09 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 513827 (view as bug list)
Environment:
Last Closed: 2008-12-24 18:47:47 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
lspci -vv (31.79 KB, text/plain)
2008-12-08 08:52 UTC, Emmanuel Kowalski
no flags Details
Output of lsmod (3.94 KB, application/octet-stream)
2008-12-08 08:53 UTC, Emmanuel Kowalski
no flags Details
[patch] firewire: fw-ohci: fix possible IOMMU resource exhaustion (3.24 KB, patch)
2008-12-09 23:26 UTC, Stefan Richter
no flags Details | Diff

Description Emmanuel Kowalski 2008-12-08 08:52:36 UTC
Created attachment 326099 [details]
lspci -vv

Description of problem:
A firewire-connected external hard disk becomes inaccessible after a certain amount of time, any access leading to the error message

Dec  8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1

in the log.

Version-Release number of selected component (if applicable): 

kernel-2.6.27.5-41.fc9.x86_64.rpm

Linux zeta 2.6.27.5-41.fc9.x86_64 #1 SMP Thu Nov 13 20:29:07 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
Always on my Thinkpad T61p.

Steps to Reproduce:
1. Connect external hard drive (WD 3200JS, Firewire 400)
firewire_core: created device fw1: GUID 0090a990e011379d, S400
firewire_core: phy config: card 0, new root=ffc0, gap_count=5
scsi5 : SBP-2 IEEE-1394
firewire_sbp2: fw1.1: logged in to LUN 0000 (0 retries)
firewire_sbp2: fw1.1: sbp2_scsi_abort
scsi 5:0:0:0: Direct-Access     WD       3200JS External  106a PQ: 0 ANSI: 4
sd 5:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB)

2. Perform disk IO, e.g., run badblocks on the disk
3. 
  
Actual results:
After a (variable) amount of time, any disk access fails with error message

Dec  8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1

Expected results:
Disk functions normally.

Additional info:
I tested also the older kernel

kernel-2.6.26.6-79.fc9.x86_64.rpm

with the same behavior. 

The SATA controller is configured in the BIOS in AHCI mode. I will attach the output of lspci -vv and lsmod.

I suspect the disk may have hardware problems, but there was no message to that effect in the log before the SW-IOMMU errors.

Comment 1 Emmanuel Kowalski 2008-12-08 08:53:22 UTC
Created attachment 326100 [details]
Output of lsmod

Comment 2 Emmanuel Kowalski 2008-12-09 08:29:21 UTC
I have checked that the problem is also present on the Fedora 10 LiveCD.

Comment 3 Stefan Richter 2008-12-09 20:15:41 UTC
"Out of SW-IOMMU space" is a kernel bug (driver bug), not a disk problem.  Alas I currently have no PC which uses the SW-IOMMU or a real IOMMU to try to reproduce the issue.

Comment 4 Jarod Wilson 2008-12-09 20:32:45 UTC
Hm... I hit this one in the past, but it took quite a while to reproduce it, and I've not seen it in a while...

A few questions:
-How much memory is in your T61?
-How quickly can you reproduce the bug running the live CD?
-What is it you're doing to reproduce the bug?

I've actually got a T61 myself with the same firewire controller and 4GB of RAM that I can try to reproduce on...

Comment 5 Stefan Richter 2008-12-09 20:50:28 UTC
I just started to look through the code for potential dma_map_ imbalances, beginning with fw-ohci.

It seems ohci_cancel_packet() lacks dma_unmap_single(...payload...).  This code is executed whenever an outbound transaction was finished before the AT req DMA context tasklet was executed.

Comment 6 Emmanuel Kowalski 2008-12-09 21:15:30 UTC
Here is the requested information:

-How much memory is in your T61?

4 Gb

-How quickly can you reproduce the bug running the live CD?
-What is it you're doing to reproduce the bug?

It is quite quick: I connect the external drive, and then run 

# badblocks -v /dev/sdb1

and after about 1 or 2 minutes (about 1% of the 320 GB disk being checked), the log starts to fill up with the error message.

The drive also has a USB port, but I haven't tried to see what happens if I connect it in this way (I need to find the correct USB cable first).  I'll try to do this tomorrow.

Comment 7 Stefan Richter 2008-12-09 21:46:05 UTC
I added payload unmapping to ohci_cancel_packet and it gets called frequently on my ICH7 + FW323 based Mac mini with a FireWire disk.  This is obviously due to fw-sbp2's ORB pointer writes.

Stress test commences now, patch will follow.

Comment 8 Stefan Richter 2008-12-09 23:26:45 UTC
Created attachment 326429 [details]
[patch] firewire: fw-ohci: fix possible IOMMU resource exhaustion

Someone who can reproduce the bug please test the patch.
Posting: http://lkml.org/lkml/2008/12/9/333

Comment 9 Stefan Richter 2008-12-10 07:36:59 UTC
> Someone who can reproduce the bug please test the patch.

PS:
It can only be reproduced if the responder node, e.g. SBP-2 harddisk, concludes the transactions as split transactions ( = request, ack-pending, response, ack-complete) and if the tasklet timing happens to be ordered as mentioned in the patch.  Unified transactions ( = request, ack-complete) cannot cause the DMA mappings leak.

I got frequent ohci_cancel_packet with an OXUF924DSB.  Emmanuel's WD 3200JS is a SATA disk, hence it is quite likely that his bridge is an OXUF92?DS? too.  But I guess that many other bridge chips conclude ORB pointer writes as split transactions as well.

On the other hand, hitting non-coherent memory on a 4 GB PC may not be as likely.

Comment 10 Emmanuel Kowalski 2008-12-10 07:42:07 UTC
I will try to test the patch today and report on the outcome.

Comment 11 Emmanuel Kowalski 2008-12-10 10:34:13 UTC
I have built a new kernel (on Fedora 9) with the proposed patch and it seems to work: I was able to run the badblocks command for about 40 minutes without the error messages appearing (previously, it would stop after about 1 or 2 minutes); it didn't complete because (as I suspected) many bad blocks were indeed found at that point

(I got the following in the log, the last message being repeated constantly, but I assume this is because of these problems and is unrelated to the other problem: 

Dec 10 11:28:50 localhost kernel: firewire_sbp2: fw1.1: sbp2_scsi_abort
Dec 10 11:28:50 localhost kernel: sd 5:0:0:0: rejecting I/O to offline device
)

By the way, the latest kernel has throughput about 4 times better than the latest 2.6.26 when accessing this external disk: congratulations to all developpers for this impressive work!

Comment 12 Stefan Richter 2008-12-10 11:58:36 UTC
> I was able to run the badblocks command for about 40 minutes without the
> error messages appearing (previously, it would stop after about 1 or 2
> minutes)

Great.  Thanks for the report and for testing the patch.

> it didn't complete because (as I suspected) many bad blocks were
> indeed found at that point
> 
> (I got the following in the log, the last message being repeated
> constantly, but I assume this is because of these problems and is
> unrelated to the other problem:
> 
> Dec 10 11:28:50 localhost kernel: firewire_sbp2: fw1.1: sbp2_scsi_abort
> Dec 10 11:28:50 localhost kernel: sd 5:0:0:0: rejecting I/O to offline
> device

You could attach the kernel log from firewire-sbp2's login until the device is put offline; maybe there are clues about the nature of this other problem.

Comment 13 Chuck Ebbert 2008-12-14 08:40:05 UTC
Fix is in F9 kernel 2.6.27.8-63

Comment 14 Fedora Update System 2008-12-17 16:22:08 UTC
kernel-2.6.27.9-73.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/kernel-2.6.27.9-73.fc9

Comment 15 Fedora Update System 2008-12-21 08:22:37 UTC
kernel-2.6.27.9-73.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing-newkey update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-11618

Comment 16 Fedora Update System 2008-12-24 18:47:33 UTC
kernel-2.6.27.9-73.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.