Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 513827

Summary: Out of SW-IOMMU space: External hard disk inaccessible
Product: Red Hat Enterprise Linux 5 Reporter: Bryn M. Reeves <bmr>
Component: kernelAssignee: Jay Fenlason <fenlason>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: cward, dzickus, fedora-bugs, jfeeney, jtluka, kernel-maint, peterm, qcai, stefan-r-rhbz, tao
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 475156 Environment:
Last Closed: 2010-03-30 07:15:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 499522, 525215, 533192    
Attachments:
Description Flags
fix firewire iommu mapping leak none

Description Bryn M. Reeves 2009-07-26 12:08:20 UTC
+++ This bug was initially created as a clone of Bug #475156 +++

Created an attachment (id=326099)
lspci -vv

Description of problem:
A firewire-connected external hard disk becomes inaccessible after a certain amount of time, any access leading to the error message

Dec  8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1

in the log.

Version-Release number of selected component (if applicable): 

kernel-2.6.27.5-41.fc9.x86_64.rpm

Linux zeta 2.6.27.5-41.fc9.x86_64 #1 SMP Thu Nov 13 20:29:07 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
Always on my Thinkpad T61p.

Steps to Reproduce:
1. Connect external hard drive (WD 3200JS, Firewire 400)
firewire_core: created device fw1: GUID 0090a990e011379d, S400
firewire_core: phy config: card 0, new root=ffc0, gap_count=5
scsi5 : SBP-2 IEEE-1394
firewire_sbp2: fw1.1: logged in to LUN 0000 (0 retries)
firewire_sbp2: fw1.1: sbp2_scsi_abort
scsi 5:0:0:0: Direct-Access     WD       3200JS External  106a PQ: 0 ANSI: 4
sd 5:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB)

2. Perform disk IO, e.g., run badblocks on the disk
3. 
  
Actual results:
After a (variable) amount of time, any disk access fails with error message

Dec  8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1

Expected results:
Disk functions normally.

Additional info:
I tested also the older kernel

kernel-2.6.26.6-79.fc9.x86_64.rpm

with the same behavior. 

The SATA controller is configured in the BIOS in AHCI mode. I will attach the output of lspci -vv and lsmod.

I suspect the disk may have hardware problems, but there was no message to that effect in the log before the SW-IOMMU errors.

--- Additional comment from stefan-r-rhbz.de on 2008-12-09 15:50:28 EDT ---

I just started to look through the code for potential dma_map_ imbalances, beginning with fw-ohci.

It seems ohci_cancel_packet() lacks dma_unmap_single(...payload...).  This code is executed whenever an outbound transaction was finished before the AT req DMA context tasklet was executed.

--- Additional comment from stefan-r-rhbz.de on 2008-12-09 18:26:45 EDT ---

Created an attachment (id=326429)
[patch] firewire: fw-ohci: fix possible IOMMU resource exhaustion

Someone who can reproduce the bug please test the patch.
Posting: http://lkml.org/lkml/2008/12/9/333

--- Additional comment from stefan-r-rhbz.de on 2008-12-10 02:36:59 EDT ---

> Someone who can reproduce the bug please test the patch.

PS:
It can only be reproduced if the responder node, e.g. SBP-2 harddisk, concludes the transactions as split transactions ( = request, ack-pending, response, ack-complete) and if the tasklet timing happens to be ordered as mentioned in the patch.  Unified transactions ( = request, ack-complete) cannot cause the DMA mappings leak.

I got frequent ohci_cancel_packet with an OXUF924DSB.  Emmanuel's WD 3200JS is a SATA disk, hence it is quite likely that his bridge is an OXUF92?DS? too.  But I guess that many other bridge chips conclude ORB pointer writes as split transactions as well.

On the other hand, hitting non-coherent memory on a 4 GB PC may not be as likely.

Comment 2 Bryn M. Reeves 2009-07-26 12:12:33 UTC
Created attachment 355191 [details]
fix firewire iommu mapping leak

Comment 4 Jay Fenlason 2009-08-05 20:06:55 UTC
I don't see why this won't make 5.5, either as a standalone patch, or as part of a re-backport of the current upstream firewire stack.

Comment 5 RHEL Program Management 2009-10-01 19:56:01 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Jan Tluka 2009-10-14 13:55:19 UTC
Hello!
Are we able to reproduce this in-house? Could we get confirmation on testing this from customer side?

Comment 7 Bryn M. Reeves 2009-10-14 14:15:41 UTC
Afaik this is fairly easily reproduced with firewire storage. I don't have any suitable hardware available here to test with but I can ask around. I'll also check with the original reporter to see if they are willing to verify.

Comment 9 Stefan Richter 2009-10-14 19:22:33 UTC
FireWire on boxes which do have an IOMMU or use swiotlb should indeed exhaust IOMMU resources rather quickly without the fix since the respective DMA mappings happen very frequently during asynchronous FireWire I/O.  Both the bug and the fix are rather obvious in hindsight, I'd say.

The fix has one problem though as the test for "was this DMA-mapped?" may yield a false negative on certain rare architectures at a very unlikely condition --- when a payload was mapped to bus address zero.  I therefore posted a suggested fix for the fix just now:  http://lkml.org/lkml/2009/10/14/362

Comment 10 Stefan Richter 2009-10-14 19:26:58 UTC
PS: "the respective DMA mappings happen very frequently during asynchronous FireWire I/O" == e.g. one of these mappings for each SCSI request on behalf of firewire-sbp2, one of these for each IP datagram on behalf of firewire-net.

Comment 11 Don Zickus 2009-10-21 19:12:26 UTC
in kernel-2.6.18-170.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 14 Chris Ward 2010-02-11 10:34:26 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 15 Bryn M. Reeves 2010-02-11 13:57:19 UTC
I've not managed to reproduce this so far but can attempt to borrow some FW hardware to do so.

Comment 19 errata-xmlrpc 2010-03-30 07:15:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html