Bug 471329

Summary: 2.6.27 + ath9k + x86_64 + 4GB RAM => out of SW-IOMMU => disk corruption (at least on Macbook Pro 3,1)
Product: [Fedora] Fedora Reporter: Maciej Żenczykowski <zenczykowski>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: medium    
Version: 9CC: axet, dcantrell, kernel-maint, matthias, mcgrof, notting, quintela, tcallawa
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-11-19 03:32:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 466414    

Description Maciej Żenczykowski 2008-11-13 00:05:38 UTC
Description of problem:

ath9k fails to work on macbook pro 3,1 with 4gb ram on x86_64 kernel
(often) resulting in disk corruption

Version-Release number of selected component (if applicable):

2.6.27.5-32.fc9.x86_64 from koji

How reproducible:

Install fc9 with 2.6.27 kernel on Macbook Pro 3,1 with 4GB ram and use the ath9k driver.  Soon you'll get:

kernel: DMA: Out of SW-IOMMU space for 4224 bytes at device 0000:0b:00.0

0b:00.0 is a 168c:0024 (rev 01) Network Controller: Atheros Communications Inc. AR5418 802.11abgn Wireless PCI Express Adapter (rev 01)

starts filling the logs, and soon after you hit IO errors on the hard drive, and ext3 detects an aborted journal and remounts read-only.  Possibility of a corrupt ext3 partition is HIGH (spent hours recovering from one of these crashes (corrupt/missing inodes/directories), second crash resulted in some corrupt files, but at least no corrupt directories).  Guessing it's some kind of buffer memory leak?

[Currently I'm using the out-of-tree madwifi ath_pci driver (from svn head) at the moment on 2.6.27.5-32.fc9.x86_64 with no problems (so far).]

Comment 1 Luis R. Rodriguez 2008-11-13 20:01:20 UTC
I'm looking into this bug report now.

Comment 2 John W. Linville 2008-11-13 20:26:30 UTC
Seems like two problems -- even if ath9k burns-up all the DMA, the SATA driver shouldn't crap all over the drive...

Comment 3 Luis R. Rodriguez 2008-11-14 00:34:30 UTC
OK the closest I was able to get in hardware is a Macbook with less than 1 GB of RAM and with kernel:

2.6.27.5-101.fc10.x86_64

This is newer than yours.

Wireless card:
168c:0024

02:00.0 Network controller: Atheros Communications Inc. AR5418 802.11abgn Wireless PCI Express Adapter (rev 01)

The ATA drive on these guys is:

[root@localhost ~]# lsmod| grep ata
ata_generic            13956  0 
pata_acpi              13056  0 

I tried connecting with an 802.11G AP and 802.11N AP and I don't get the issues you are seeing. What SATA controller do you have?

Also I see similar reports on the Ubuntu side:

https://bugs.edge.launchpad.net/mactel-support/+bug/267089/

And all related to MacBook Pro 3.1 with 4 GB of RAM. Seems we'll need to get one of those guys.

Comment 4 Luis R. Rodriguez 2008-11-14 03:02:32 UTC
OK I was not able to reproduce this on a x86_64 box with > 7 GB of memory on 2.6.27.5-101.fc10.x86_64.

Do you guys have something like this on lspci:

Intel Corporation Mobile PM965/GM965/GL960 Memory Controller Hub

It seems the Intel 965 memory controller does IOMMU in software so I wonder if that is 100% correct if its the software for that. Not sure, just something someone pointed out to me.

You can try these patches, the first one just clarifies the type of memory we want to deal with the second one makes it consistent.

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2008-11-14/

Comment 5 Luis R. Rodriguez 2008-11-14 03:03:08 UTC
Oh and Ubuntu folks also ran into this but on MacBook Pro 3.1 and on >= 4 GB on memory:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/267089

Comment 6 Maciej Żenczykowski 2008-11-14 04:19:59 UTC
$ /sbin/lspci
00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory Controller Hub (rev 03)
00:01.0 PCI bridge: Intel Corporation Mobile PM965/GM965/GL960 PCI Express Root Port (rev 03)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #5 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 03)
00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 3 (rev 03)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5 (rev 03)
00:1c.5 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 6 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev f3)
00:1f.0 ISA bridge: Intel Corporation 82801HEM (ICH8M) LPC Interface Controller (rev 03)
00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) IDE Controller (rev 03)
00:1f.2 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA IDE Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 03)
01:00.0 VGA compatible controller: nVidia Corporation GeForce 8600M GT (rev a1)
0b:00.0 Network controller: Atheros Communications Inc. AR5418 802.11abgn Wireless PCI Express Adapter (rev 01)
0c:00.0 Ethernet controller: Marvell Technology Group Ltd. Marvell Yukon 88E8058 PCI-E Gigabit Ethernet Controller (rev 13)
0d:03.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller (rev 02)

So yes.  Machine has been running since morning on 2.6.27.5 + ath_pci with no other problems, so this is somehow related to ath9k.

Comment 7 Maciej Żenczykowski 2008-11-14 05:41:25 UTC
$ sudo dmidecode | grep Product
        Product Name: MacBookPro3,1
        Product Name: Mac-F4238BC8

So this is exactly the same issue as reported in the ubuntu bug.

Comment 8 Luis R. Rodriguez 2008-11-14 06:21:10 UTC
Right, so I'm not able to reproduce it on a x86_64 box with > 7 GB of memory. The only difference though is I am trying with a new ath9k device, AR9280, yours is AR5418 so I suppose it could be that, I'll see if can get one of these tomorrow but not sure if that would be the culprit.

I am trying to get details of how the IO MMU is hanlded for Intel's 965 memory controller, is it in hardware or in software and if in software I want to review it for > 2 GB.

Try to remove 2 GB from your box and see if you see the issue, for example.

Comment 9 Matthias Goldhoorn 2008-11-14 08:05:09 UTC
I've been postet the same problem on Kernel.org bugzilla,
it seems strongly depent on RAM, try my workaround boot with mem=3G option

http://bugzilla.kernel.org/show_bug.cgi?id=11811

Greets
Matthias

Comment 10 Maciej Żenczykowski 2008-11-14 10:05:49 UTC
mem=3G will likely fix it because the 3-4G 'logical' memory range is the only memory above the physical 32bit barrier (ie. the 4th and final full GB of RAM is mapped at the physical 4-5G area) - was planning on testing this myself just to verify supposition that all ram would then be 32-bit, and thus there should be no need for SW-IOMMU.  Compiling kernel with above mentioned patches atm, will see if it helps...

BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000beec4000 (usable)
 BIOS-e820: 00000000beec4000 - 00000000bf0c5000 (ACPI NVS)
 BIOS-e820: 00000000bf0c5000 - 00000000bfeb9000 (ACPI data)
 BIOS-e820: 00000000bfeb9000 - 00000000bfebf000 (reserved)
 BIOS-e820: 00000000bfebf000 - 00000000bfed2000 (ACPI data)
 BIOS-e820: 00000000bfed2000 - 00000000bfed4000 (ACPI NVS)
 BIOS-e820: 00000000bfed4000 - 00000000bfed7000 (ACPI data)
 BIOS-e820: 00000000bfed7000 - 00000000bfeda000 (ACPI NVS)
 BIOS-e820: 00000000bfeda000 - 00000000bfedb000 (ACPI data)
 BIOS-e820: 00000000bfedb000 - 00000000bfeef000 (ACPI NVS)
 BIOS-e820: 00000000bfeef000 - 00000000bff00000 (ACPI data)
 BIOS-e820: 00000000bff00000 - 00000000c0000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
 BIOS-e820: 00000000fed1c000 - 00000000fed20000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000140000000 (usable)


(side note... how much frickin' space do you need for a kernel compile... 4G free and still failing to compile because I run out of disk space...)

Comment 11 Luis R. Rodriguez 2008-11-15 02:15:48 UTC
OK so there a few bug reports on different sites, please stop posting to this one and refer to this one from now on:

http://bugzilla.kernel.org/show_bug.cgi?id=11811

If a fix comes up it will be propagated. For now try booting with mem=3G

Comment 12 Luis R. Rodriguez 2008-11-15 16:51:12 UTC
Please subscribe to the kernel.org bugzilla bug instead and please comment on the questions posted there:

http://bugzilla.kernel.org/show_bug.cgi?id=11811

Comment 13 Chuck Ebbert 2008-11-18 18:15:18 UTC
Workaround applied in F-10 kernel -117 and F-9 kernel -44. The driver won't activate the adapter if swiotlb is in use now.

Comment 14 Jesse Keating 2008-11-19 03:32:45 UTC
Closing as 117 has been tagged for F10-final.