513827 – Out of SW-IOMMU space: External hard disk inaccessible

Bug 513827 - Out of SW-IOMMU space: External hard disk inaccessible

Summary: Out of SW-IOMMU space: External hard disk inaccessible

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jay Fenlason
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	499522 525215 533192
TreeView+	depends on / blocked

Reported:	2009-07-26 12:08 UTC by Bryn M. Reeves
Modified:	2018-10-27 15:01 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	475156
Environment:
Last Closed:	2010-03-30 07:15:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
fix firewire iommu mapping leak (2.87 KB, patch) 2009-07-26 12:12 UTC, Bryn M. Reeves	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description Bryn M. Reeves 2009-07-26 12:08:20 UTC

+++ This bug was initially created as a clone of Bug #475156 +++

Created an attachment (id=326099)
lspci -vv

Description of problem:
A firewire-connected external hard disk becomes inaccessible after a certain amount of time, any access leading to the error message

Dec  8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1

in the log.

Version-Release number of selected component (if applicable): 

kernel-2.6.27.5-41.fc9.x86_64.rpm

Linux zeta 2.6.27.5-41.fc9.x86_64 #1 SMP Thu Nov 13 20:29:07 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
Always on my Thinkpad T61p.

Steps to Reproduce:
1. Connect external hard drive (WD 3200JS, Firewire 400)
firewire_core: created device fw1: GUID 0090a990e011379d, S400
firewire_core: phy config: card 0, new root=ffc0, gap_count=5
scsi5 : SBP-2 IEEE-1394
firewire_sbp2: fw1.1: logged in to LUN 0000 (0 retries)
firewire_sbp2: fw1.1: sbp2_scsi_abort
scsi 5:0:0:0: Direct-Access     WD       3200JS External  106a PQ: 0 ANSI: 4
sd 5:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB)

2. Perform disk IO, e.g., run badblocks on the disk
3. 
  
Actual results:
After a (variable) amount of time, any disk access fails with error message

Dec  8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1

Expected results:
Disk functions normally.

Additional info:
I tested also the older kernel

kernel-2.6.26.6-79.fc9.x86_64.rpm

with the same behavior. 

The SATA controller is configured in the BIOS in AHCI mode. I will attach the output of lspci -vv and lsmod.

I suspect the disk may have hardware problems, but there was no message to that effect in the log before the SW-IOMMU errors.

--- Additional comment from stefan-r-rhbz.de on 2008-12-09 15:50:28 EDT ---

I just started to look through the code for potential dma_map_ imbalances, beginning with fw-ohci.

It seems ohci_cancel_packet() lacks dma_unmap_single(...payload...).  This code is executed whenever an outbound transaction was finished before the AT req DMA context tasklet was executed.

--- Additional comment from stefan-r-rhbz.de on 2008-12-09 18:26:45 EDT ---

Created an attachment (id=326429)
[patch] firewire: fw-ohci: fix possible IOMMU resource exhaustion

Someone who can reproduce the bug please test the patch.
Posting: http://lkml.org/lkml/2008/12/9/333

--- Additional comment from stefan-r-rhbz.de on 2008-12-10 02:36:59 EDT ---

> Someone who can reproduce the bug please test the patch.

PS:
It can only be reproduced if the responder node, e.g. SBP-2 harddisk, concludes the transactions as split transactions ( = request, ack-pending, response, ack-complete) and if the tasklet timing happens to be ordered as mentioned in the patch.  Unified transactions ( = request, ack-complete) cannot cause the DMA mappings leak.

I got frequent ohci_cancel_packet with an OXUF924DSB.  Emmanuel's WD 3200JS is a SATA disk, hence it is quite likely that his bridge is an OXUF92?DS? too.  But I guess that many other bridge chips conclude ORB pointer writes as split transactions as well.

On the other hand, hitting non-coherent memory on a 4 GB PC may not be as likely.

Comment 2 Bryn M. Reeves 2009-07-26 12:12:33 UTC

Created attachment 355191 [details]
fix firewire iommu mapping leak

Comment 4 Jay Fenlason 2009-08-05 20:06:55 UTC

I don't see why this won't make 5.5, either as a standalone patch, or as part of a re-backport of the current upstream firewire stack.

Comment 5 RHEL Program Management 2009-10-01 19:56:01 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Jan Tluka 2009-10-14 13:55:19 UTC

Hello!
Are we able to reproduce this in-house? Could we get confirmation on testing this from customer side?

Comment 7 Bryn M. Reeves 2009-10-14 14:15:41 UTC

Afaik this is fairly easily reproduced with firewire storage. I don't have any suitable hardware available here to test with but I can ask around. I'll also check with the original reporter to see if they are willing to verify.

Comment 9 Stefan Richter 2009-10-14 19:22:33 UTC

FireWire on boxes which do have an IOMMU or use swiotlb should indeed exhaust IOMMU resources rather quickly without the fix since the respective DMA mappings happen very frequently during asynchronous FireWire I/O.  Both the bug and the fix are rather obvious in hindsight, I'd say.

The fix has one problem though as the test for "was this DMA-mapped?" may yield a false negative on certain rare architectures at a very unlikely condition --- when a payload was mapped to bus address zero.  I therefore posted a suggested fix for the fix just now:  http://lkml.org/lkml/2009/10/14/362

Comment 10 Stefan Richter 2009-10-14 19:26:58 UTC

PS: "the respective DMA mappings happen very frequently during asynchronous FireWire I/O" == e.g. one of these mappings for each SCSI request on behalf of firewire-sbp2, one of these for each IP datagram on behalf of firewire-net.

Comment 11 Don Zickus 2009-10-21 19:12:26 UTC

in kernel-2.6.18-170.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 14 Chris Ward 2010-02-11 10:34:26 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 15 Bryn M. Reeves 2010-02-11 13:57:19 UTC

I've not managed to reproduce this so far but can attempt to borrow some FW hardware to do so.

Comment 19 errata-xmlrpc 2010-03-30 07:15:21 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.