+++ This bug was initially created as a clone of Bug #475156 +++ Created an attachment (id=326099) lspci -vv Description of problem: A firewire-connected external hard disk becomes inaccessible after a certain amount of time, any access leading to the error message Dec 8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1 in the log. Version-Release number of selected component (if applicable): kernel-2.6.27.5-41.fc9.x86_64.rpm Linux zeta 2.6.27.5-41.fc9.x86_64 #1 SMP Thu Nov 13 20:29:07 EST 2008 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Always on my Thinkpad T61p. Steps to Reproduce: 1. Connect external hard drive (WD 3200JS, Firewire 400) firewire_core: created device fw1: GUID 0090a990e011379d, S400 firewire_core: phy config: card 0, new root=ffc0, gap_count=5 scsi5 : SBP-2 IEEE-1394 firewire_sbp2: fw1.1: logged in to LUN 0000 (0 retries) firewire_sbp2: fw1.1: sbp2_scsi_abort scsi 5:0:0:0: Direct-Access WD 3200JS External 106a PQ: 0 ANSI: 4 sd 5:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB) 2. Perform disk IO, e.g., run badblocks on the disk 3. Actual results: After a (variable) amount of time, any disk access fails with error message Dec 8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1 Expected results: Disk functions normally. Additional info: I tested also the older kernel kernel-2.6.26.6-79.fc9.x86_64.rpm with the same behavior. The SATA controller is configured in the BIOS in AHCI mode. I will attach the output of lspci -vv and lsmod. I suspect the disk may have hardware problems, but there was no message to that effect in the log before the SW-IOMMU errors. --- Additional comment from stefan-r-rhbz.de on 2008-12-09 15:50:28 EDT --- I just started to look through the code for potential dma_map_ imbalances, beginning with fw-ohci. It seems ohci_cancel_packet() lacks dma_unmap_single(...payload...). This code is executed whenever an outbound transaction was finished before the AT req DMA context tasklet was executed. --- Additional comment from stefan-r-rhbz.de on 2008-12-09 18:26:45 EDT --- Created an attachment (id=326429) [patch] firewire: fw-ohci: fix possible IOMMU resource exhaustion Someone who can reproduce the bug please test the patch. Posting: http://lkml.org/lkml/2008/12/9/333 --- Additional comment from stefan-r-rhbz.de on 2008-12-10 02:36:59 EDT --- > Someone who can reproduce the bug please test the patch. PS: It can only be reproduced if the responder node, e.g. SBP-2 harddisk, concludes the transactions as split transactions ( = request, ack-pending, response, ack-complete) and if the tasklet timing happens to be ordered as mentioned in the patch. Unified transactions ( = request, ack-complete) cannot cause the DMA mappings leak. I got frequent ohci_cancel_packet with an OXUF924DSB. Emmanuel's WD 3200JS is a SATA disk, hence it is quite likely that his bridge is an OXUF92?DS? too. But I guess that many other bridge chips conclude ORB pointer writes as split transactions as well. On the other hand, hitting non-coherent memory on a 4 GB PC may not be as likely.
Created attachment 355191 [details] fix firewire iommu mapping leak
I don't see why this won't make 5.5, either as a standalone patch, or as part of a re-backport of the current upstream firewire stack.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Hello! Are we able to reproduce this in-house? Could we get confirmation on testing this from customer side?
Afaik this is fairly easily reproduced with firewire storage. I don't have any suitable hardware available here to test with but I can ask around. I'll also check with the original reporter to see if they are willing to verify.
FireWire on boxes which do have an IOMMU or use swiotlb should indeed exhaust IOMMU resources rather quickly without the fix since the respective DMA mappings happen very frequently during asynchronous FireWire I/O. Both the bug and the fix are rather obvious in hindsight, I'd say. The fix has one problem though as the test for "was this DMA-mapped?" may yield a false negative on certain rare architectures at a very unlikely condition --- when a payload was mapped to bus address zero. I therefore posted a suggested fix for the fix just now: http://lkml.org/lkml/2009/10/14/362
PS: "the respective DMA mappings happen very frequently during asynchronous FireWire I/O" == e.g. one of these mappings for each SCSI request on behalf of firewire-sbp2, one of these for each IP datagram on behalf of firewire-net.
in kernel-2.6.18-170.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
I've not managed to reproduce this so far but can attempt to borrow some FW hardware to do so.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html