Bug 475156

Summary: Out of SW-IOMMU space: External hard disk inaccessible
Product: [Fedora] Fedora Reporter: Emmanuel Kowalski <emmanuel.kowalski>
Component: kernelAssignee: Jarod Wilson <jarod>
Status: CLOSED NEXTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 9CC: fedora-bugs, kernel-maint, quintela, stefan-r-rhbz
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 513827 (view as bug list) Environment:
Last Closed: 2008-12-24 18:47:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
lspci -vv
none
Output of lsmod
none
[patch] firewire: fw-ohci: fix possible IOMMU resource exhaustion none

Description Emmanuel Kowalski 2008-12-08 08:52:36 UTC
Created attachment 326099 [details]
lspci -vv

Description of problem:
A firewire-connected external hard disk becomes inaccessible after a certain amount of time, any access leading to the error message

Dec  8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1

in the log.

Version-Release number of selected component (if applicable): 

kernel-2.6.27.5-41.fc9.x86_64.rpm

Linux zeta 2.6.27.5-41.fc9.x86_64 #1 SMP Thu Nov 13 20:29:07 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
Always on my Thinkpad T61p.

Steps to Reproduce:
1. Connect external hard drive (WD 3200JS, Firewire 400)
firewire_core: created device fw1: GUID 0090a990e011379d, S400
firewire_core: phy config: card 0, new root=ffc0, gap_count=5
scsi5 : SBP-2 IEEE-1394
firewire_sbp2: fw1.1: logged in to LUN 0000 (0 retries)
firewire_sbp2: fw1.1: sbp2_scsi_abort
scsi 5:0:0:0: Direct-Access     WD       3200JS External  106a PQ: 0 ANSI: 4
sd 5:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB)

2. Perform disk IO, e.g., run badblocks on the disk
3. 
  
Actual results:
After a (variable) amount of time, any disk access fails with error message

Dec  8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1

Expected results:
Disk functions normally.

Additional info:
I tested also the older kernel

kernel-2.6.26.6-79.fc9.x86_64.rpm

with the same behavior. 

The SATA controller is configured in the BIOS in AHCI mode. I will attach the output of lspci -vv and lsmod.

I suspect the disk may have hardware problems, but there was no message to that effect in the log before the SW-IOMMU errors.

Comment 1 Emmanuel Kowalski 2008-12-08 08:53:22 UTC
Created attachment 326100 [details]
Output of lsmod

Comment 2 Emmanuel Kowalski 2008-12-09 08:29:21 UTC
I have checked that the problem is also present on the Fedora 10 LiveCD.

Comment 3 Stefan Richter 2008-12-09 20:15:41 UTC
"Out of SW-IOMMU space" is a kernel bug (driver bug), not a disk problem.  Alas I currently have no PC which uses the SW-IOMMU or a real IOMMU to try to reproduce the issue.

Comment 4 Jarod Wilson 2008-12-09 20:32:45 UTC
Hm... I hit this one in the past, but it took quite a while to reproduce it, and I've not seen it in a while...

A few questions:
-How much memory is in your T61?
-How quickly can you reproduce the bug running the live CD?
-What is it you're doing to reproduce the bug?

I've actually got a T61 myself with the same firewire controller and 4GB of RAM that I can try to reproduce on...

Comment 5 Stefan Richter 2008-12-09 20:50:28 UTC
I just started to look through the code for potential dma_map_ imbalances, beginning with fw-ohci.

It seems ohci_cancel_packet() lacks dma_unmap_single(...payload...).  This code is executed whenever an outbound transaction was finished before the AT req DMA context tasklet was executed.

Comment 6 Emmanuel Kowalski 2008-12-09 21:15:30 UTC
Here is the requested information:

-How much memory is in your T61?

4 Gb

-How quickly can you reproduce the bug running the live CD?
-What is it you're doing to reproduce the bug?

It is quite quick: I connect the external drive, and then run 

# badblocks -v /dev/sdb1

and after about 1 or 2 minutes (about 1% of the 320 GB disk being checked), the log starts to fill up with the error message.

The drive also has a USB port, but I haven't tried to see what happens if I connect it in this way (I need to find the correct USB cable first).  I'll try to do this tomorrow.

Comment 7 Stefan Richter 2008-12-09 21:46:05 UTC
I added payload unmapping to ohci_cancel_packet and it gets called frequently on my ICH7 + FW323 based Mac mini with a FireWire disk.  This is obviously due to fw-sbp2's ORB pointer writes.

Stress test commences now, patch will follow.

Comment 8 Stefan Richter 2008-12-09 23:26:45 UTC
Created attachment 326429 [details]
[patch] firewire: fw-ohci: fix possible IOMMU resource exhaustion

Someone who can reproduce the bug please test the patch.
Posting: http://lkml.org/lkml/2008/12/9/333

Comment 9 Stefan Richter 2008-12-10 07:36:59 UTC
> Someone who can reproduce the bug please test the patch.

PS:
It can only be reproduced if the responder node, e.g. SBP-2 harddisk, concludes the transactions as split transactions ( = request, ack-pending, response, ack-complete) and if the tasklet timing happens to be ordered as mentioned in the patch.  Unified transactions ( = request, ack-complete) cannot cause the DMA mappings leak.

I got frequent ohci_cancel_packet with an OXUF924DSB.  Emmanuel's WD 3200JS is a SATA disk, hence it is quite likely that his bridge is an OXUF92?DS? too.  But I guess that many other bridge chips conclude ORB pointer writes as split transactions as well.

On the other hand, hitting non-coherent memory on a 4 GB PC may not be as likely.

Comment 10 Emmanuel Kowalski 2008-12-10 07:42:07 UTC
I will try to test the patch today and report on the outcome.

Comment 11 Emmanuel Kowalski 2008-12-10 10:34:13 UTC
I have built a new kernel (on Fedora 9) with the proposed patch and it seems to work: I was able to run the badblocks command for about 40 minutes without the error messages appearing (previously, it would stop after about 1 or 2 minutes); it didn't complete because (as I suspected) many bad blocks were indeed found at that point

(I got the following in the log, the last message being repeated constantly, but I assume this is because of these problems and is unrelated to the other problem: 

Dec 10 11:28:50 localhost kernel: firewire_sbp2: fw1.1: sbp2_scsi_abort
Dec 10 11:28:50 localhost kernel: sd 5:0:0:0: rejecting I/O to offline device
)

By the way, the latest kernel has throughput about 4 times better than the latest 2.6.26 when accessing this external disk: congratulations to all developpers for this impressive work!

Comment 12 Stefan Richter 2008-12-10 11:58:36 UTC
> I was able to run the badblocks command for about 40 minutes without the
> error messages appearing (previously, it would stop after about 1 or 2
> minutes)

Great.  Thanks for the report and for testing the patch.

> it didn't complete because (as I suspected) many bad blocks were
> indeed found at that point
> 
> (I got the following in the log, the last message being repeated
> constantly, but I assume this is because of these problems and is
> unrelated to the other problem:
> 
> Dec 10 11:28:50 localhost kernel: firewire_sbp2: fw1.1: sbp2_scsi_abort
> Dec 10 11:28:50 localhost kernel: sd 5:0:0:0: rejecting I/O to offline
> device

You could attach the kernel log from firewire-sbp2's login until the device is put offline; maybe there are clues about the nature of this other problem.

Comment 13 Chuck Ebbert 2008-12-14 08:40:05 UTC
Fix is in F9 kernel 2.6.27.8-63

Comment 14 Fedora Update System 2008-12-17 16:22:08 UTC
kernel-2.6.27.9-73.fc9 has been submitted as an update for Fedora 9.
http://admin.fedoraproject.org/updates/kernel-2.6.27.9-73.fc9

Comment 15 Fedora Update System 2008-12-21 08:22:37 UTC
kernel-2.6.27.9-73.fc9 has been pushed to the Fedora 9 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing-newkey update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-11618

Comment 16 Fedora Update System 2008-12-24 18:47:33 UTC
kernel-2.6.27.9-73.fc9 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.