Bug 475156
Summary: | Out of SW-IOMMU space: External hard disk inaccessible | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Emmanuel Kowalski <emmanuel.kowalski> | ||||||||
Component: | kernel | Assignee: | Jarod Wilson <jarod> | ||||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 9 | CC: | fedora-bugs, kernel-maint, quintela, stefan-r-rhbz | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 513827 (view as bug list) | Environment: | |||||||||
Last Closed: | 2008-12-24 18:47:47 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Created attachment 326100 [details]
Output of lsmod
I have checked that the problem is also present on the Fedora 10 LiveCD. "Out of SW-IOMMU space" is a kernel bug (driver bug), not a disk problem. Alas I currently have no PC which uses the SW-IOMMU or a real IOMMU to try to reproduce the issue. Hm... I hit this one in the past, but it took quite a while to reproduce it, and I've not seen it in a while... A few questions: -How much memory is in your T61? -How quickly can you reproduce the bug running the live CD? -What is it you're doing to reproduce the bug? I've actually got a T61 myself with the same firewire controller and 4GB of RAM that I can try to reproduce on... I just started to look through the code for potential dma_map_ imbalances, beginning with fw-ohci. It seems ohci_cancel_packet() lacks dma_unmap_single(...payload...). This code is executed whenever an outbound transaction was finished before the AT req DMA context tasklet was executed. Here is the requested information: -How much memory is in your T61? 4 Gb -How quickly can you reproduce the bug running the live CD? -What is it you're doing to reproduce the bug? It is quite quick: I connect the external drive, and then run # badblocks -v /dev/sdb1 and after about 1 or 2 minutes (about 1% of the 320 GB disk being checked), the log starts to fill up with the error message. The drive also has a USB port, but I haven't tried to see what happens if I connect it in this way (I need to find the correct USB cable first). I'll try to do this tomorrow. I added payload unmapping to ohci_cancel_packet and it gets called frequently on my ICH7 + FW323 based Mac mini with a FireWire disk. This is obviously due to fw-sbp2's ORB pointer writes. Stress test commences now, patch will follow. Created attachment 326429 [details] [patch] firewire: fw-ohci: fix possible IOMMU resource exhaustion Someone who can reproduce the bug please test the patch. Posting: http://lkml.org/lkml/2008/12/9/333 > Someone who can reproduce the bug please test the patch.
PS:
It can only be reproduced if the responder node, e.g. SBP-2 harddisk, concludes the transactions as split transactions ( = request, ack-pending, response, ack-complete) and if the tasklet timing happens to be ordered as mentioned in the patch. Unified transactions ( = request, ack-complete) cannot cause the DMA mappings leak.
I got frequent ohci_cancel_packet with an OXUF924DSB. Emmanuel's WD 3200JS is a SATA disk, hence it is quite likely that his bridge is an OXUF92?DS? too. But I guess that many other bridge chips conclude ORB pointer writes as split transactions as well.
On the other hand, hitting non-coherent memory on a 4 GB PC may not be as likely.
I will try to test the patch today and report on the outcome. I have built a new kernel (on Fedora 9) with the proposed patch and it seems to work: I was able to run the badblocks command for about 40 minutes without the error messages appearing (previously, it would stop after about 1 or 2 minutes); it didn't complete because (as I suspected) many bad blocks were indeed found at that point (I got the following in the log, the last message being repeated constantly, but I assume this is because of these problems and is unrelated to the other problem: Dec 10 11:28:50 localhost kernel: firewire_sbp2: fw1.1: sbp2_scsi_abort Dec 10 11:28:50 localhost kernel: sd 5:0:0:0: rejecting I/O to offline device ) By the way, the latest kernel has throughput about 4 times better than the latest 2.6.26 when accessing this external disk: congratulations to all developpers for this impressive work! > I was able to run the badblocks command for about 40 minutes without the > error messages appearing (previously, it would stop after about 1 or 2 > minutes) Great. Thanks for the report and for testing the patch. > it didn't complete because (as I suspected) many bad blocks were > indeed found at that point > > (I got the following in the log, the last message being repeated > constantly, but I assume this is because of these problems and is > unrelated to the other problem: > > Dec 10 11:28:50 localhost kernel: firewire_sbp2: fw1.1: sbp2_scsi_abort > Dec 10 11:28:50 localhost kernel: sd 5:0:0:0: rejecting I/O to offline > device You could attach the kernel log from firewire-sbp2's login until the device is put offline; maybe there are clues about the nature of this other problem. Fix is in F9 kernel 2.6.27.8-63 kernel-2.6.27.9-73.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/kernel-2.6.27.9-73.fc9 kernel-2.6.27.9-73.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing-newkey update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-11618 kernel-2.6.27.9-73.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report. |
Created attachment 326099 [details] lspci -vv Description of problem: A firewire-connected external hard disk becomes inaccessible after a certain amount of time, any access leading to the error message Dec 8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1 in the log. Version-Release number of selected component (if applicable): kernel-2.6.27.5-41.fc9.x86_64.rpm Linux zeta 2.6.27.5-41.fc9.x86_64 #1 SMP Thu Nov 13 20:29:07 EST 2008 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Always on my Thinkpad T61p. Steps to Reproduce: 1. Connect external hard drive (WD 3200JS, Firewire 400) firewire_core: created device fw1: GUID 0090a990e011379d, S400 firewire_core: phy config: card 0, new root=ffc0, gap_count=5 scsi5 : SBP-2 IEEE-1394 firewire_sbp2: fw1.1: logged in to LUN 0000 (0 retries) firewire_sbp2: fw1.1: sbp2_scsi_abort scsi 5:0:0:0: Direct-Access WD 3200JS External 106a PQ: 0 ANSI: 4 sd 5:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB) 2. Perform disk IO, e.g., run badblocks on the disk 3. Actual results: After a (variable) amount of time, any disk access fails with error message Dec 8 09:42:05 localhost kernel: DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:15:00.1 Expected results: Disk functions normally. Additional info: I tested also the older kernel kernel-2.6.26.6-79.fc9.x86_64.rpm with the same behavior. The SATA controller is configured in the BIOS in AHCI mode. I will attach the output of lspci -vv and lsmod. I suspect the disk may have hardware problems, but there was no message to that effect in the log before the SW-IOMMU errors.