Created attachment 326099 [details] lspci -vv Description of problem: A firewire-connected external hard disk becomes inaccessible after a certain amount of time, any access leading to the error message Dec 8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1 in the log. Version-Release number of selected component (if applicable): kernel-2.6.27.5-41.fc9.x86_64.rpm Linux zeta 2.6.27.5-41.fc9.x86_64 #1 SMP Thu Nov 13 20:29:07 EST 2008 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Always on my Thinkpad T61p. Steps to Reproduce: 1. Connect external hard drive (WD 3200JS, Firewire 400) firewire_core: created device fw1: GUID 0090a990e011379d, S400 firewire_core: phy config: card 0, new root=ffc0, gap_count=5 scsi5 : SBP-2 IEEE-1394 firewire_sbp2: fw1.1: logged in to LUN 0000 (0 retries) firewire_sbp2: fw1.1: sbp2_scsi_abort scsi 5:0:0:0: Direct-Access WD 3200JS External 106a PQ: 0 ANSI: 4 sd 5:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB) 2. Perform disk IO, e.g., run badblocks on the disk 3. Actual results: After a (variable) amount of time, any disk access fails with error message Dec 8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1 Expected results: Disk functions normally. Additional info: I tested also the older kernel kernel-2.6.26.6-79.fc9.x86_64.rpm with the same behavior. The SATA controller is configured in the BIOS in AHCI mode. I will attach the output of lspci -vv and lsmod. I suspect the disk may have hardware problems, but there was no message to that effect in the log before the SW-IOMMU errors.
Created attachment 326100 [details] Output of lsmod
I have checked that the problem is also present on the Fedora 10 LiveCD.
"Out of SW-IOMMU space" is a kernel bug (driver bug), not a disk problem. Alas I currently have no PC which uses the SW-IOMMU or a real IOMMU to try to reproduce the issue.
Hm... I hit this one in the past, but it took quite a while to reproduce it, and I've not seen it in a while... A few questions: -How much memory is in your T61? -How quickly can you reproduce the bug running the live CD? -What is it you're doing to reproduce the bug? I've actually got a T61 myself with the same firewire controller and 4GB of RAM that I can try to reproduce on...
I just started to look through the code for potential dma_map_ imbalances, beginning with fw-ohci. It seems ohci_cancel_packet() lacks dma_unmap_single(...payload...). This code is executed whenever an outbound transaction was finished before the AT req DMA context tasklet was executed.
Here is the requested information: -How much memory is in your T61? 4 Gb -How quickly can you reproduce the bug running the live CD? -What is it you're doing to reproduce the bug? It is quite quick: I connect the external drive, and then run # badblocks -v /dev/sdb1 and after about 1 or 2 minutes (about 1% of the 320 GB disk being checked), the log starts to fill up with the error message. The drive also has a USB port, but I haven't tried to see what happens if I connect it in this way (I need to find the correct USB cable first). I'll try to do this tomorrow.
I added payload unmapping to ohci_cancel_packet and it gets called frequently on my ICH7 + FW323 based Mac mini with a FireWire disk. This is obviously due to fw-sbp2's ORB pointer writes. Stress test commences now, patch will follow.
Created attachment 326429 [details] [patch] firewire: fw-ohci: fix possible IOMMU resource exhaustion Someone who can reproduce the bug please test the patch. Posting: http://lkml.org/lkml/2008/12/9/333
> Someone who can reproduce the bug please test the patch. PS: It can only be reproduced if the responder node, e.g. SBP-2 harddisk, concludes the transactions as split transactions ( = request, ack-pending, response, ack-complete) and if the tasklet timing happens to be ordered as mentioned in the patch. Unified transactions ( = request, ack-complete) cannot cause the DMA mappings leak. I got frequent ohci_cancel_packet with an OXUF924DSB. Emmanuel's WD 3200JS is a SATA disk, hence it is quite likely that his bridge is an OXUF92?DS? too. But I guess that many other bridge chips conclude ORB pointer writes as split transactions as well. On the other hand, hitting non-coherent memory on a 4 GB PC may not be as likely.
I will try to test the patch today and report on the outcome.
I have built a new kernel (on Fedora 9) with the proposed patch and it seems to work: I was able to run the badblocks command for about 40 minutes without the error messages appearing (previously, it would stop after about 1 or 2 minutes); it didn't complete because (as I suspected) many bad blocks were indeed found at that point (I got the following in the log, the last message being repeated constantly, but I assume this is because of these problems and is unrelated to the other problem: Dec 10 11:28:50 localhost kernel: firewire_sbp2: fw1.1: sbp2_scsi_abort Dec 10 11:28:50 localhost kernel: sd 5:0:0:0: rejecting I/O to offline device ) By the way, the latest kernel has throughput about 4 times better than the latest 2.6.26 when accessing this external disk: congratulations to all developpers for this impressive work!
> I was able to run the badblocks command for about 40 minutes without the > error messages appearing (previously, it would stop after about 1 or 2 > minutes) Great. Thanks for the report and for testing the patch. > it didn't complete because (as I suspected) many bad blocks were > indeed found at that point > > (I got the following in the log, the last message being repeated > constantly, but I assume this is because of these problems and is > unrelated to the other problem: > > Dec 10 11:28:50 localhost kernel: firewire_sbp2: fw1.1: sbp2_scsi_abort > Dec 10 11:28:50 localhost kernel: sd 5:0:0:0: rejecting I/O to offline > device You could attach the kernel log from firewire-sbp2's login until the device is put offline; maybe there are clues about the nature of this other problem.
Fix is in F9 kernel 2.6.27.8-63
kernel-2.6.27.9-73.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/kernel-2.6.27.9-73.fc9
kernel-2.6.27.9-73.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing-newkey update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-11618
kernel-2.6.27.9-73.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report.