Bug 672723 - ATA errors and loss of DMA after Suspend / Resume on a Macbook
Summary: ATA errors and loss of DMA after Suspend / Resume on a Macbook
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 14
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-26 01:45 UTC by Steven Ellis
Modified: 2012-08-16 13:44 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-08-16 13:44:22 UTC
Type: ---


Attachments (Terms of Use)
Dmidecode output (13.17 KB, text/plain)
2011-01-26 01:47 UTC, Steven Ellis
no flags Details
Output of sdparm -a /dev/sda before the error occurs. (992 bytes, application/octet-stream)
2011-01-26 01:48 UTC, Steven Ellis
no flags Details
Output of smartctl -a /dev/sda before the issue occurs (4.59 KB, application/octet-stream)
2011-01-26 01:49 UTC, Steven Ellis
no flags Details

Description Steven Ellis 2011-01-26 01:45:34 UTC
Description of problem:

Generation 3 white macbook running Fedora 14 64bit.
CPU Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz

Version-Release number of selected component (if applicable):

Kernel - 2.6.35.10-74.fc14.x86_64
OS = Fedora 14 64bit

How reproducible:

Consistently

Steps to Reproduce:
1. Boot Fedora 14 on Macbook and login
2. Suspend via menu or by closing lid
3. Resume system and perform normal operations
4. Repeat steps 2 & 3 until the following appears in the system logs.
5. Once error occurs I/O performance is seriously degraded as we have no DMA.
  
Actual results:

[ 5172.307016] irq 18: nobody cared (try booting with the "irqpoll" option)
[ 5172.307022] Pid: 0, comm: swapper Tainted: P            2.6.35.10-74.fc14.x86_64 #1
[ 5172.307024] Call Trace:
[ 5172.307026]  <IRQ>  [<ffffffff810a6fdb>] __report_bad_irq.clone.1+0x3d/0x8b
[ 5172.307035]  [<ffffffff810a7143>] note_interrupt+0x11a/0x17f
[ 5172.307039]  [<ffffffff810a7c23>] handle_fasteoi_irq+0xa8/0xce
[ 5172.307043]  [<ffffffff8100c2ea>] handle_irq+0x88/0x90
[ 5172.307046]  [<ffffffff8146fb44>] do_IRQ+0x5c/0xb4
[ 5172.307050]  [<ffffffff8146a093>] ret_from_intr+0x0/0x11
[ 5172.307051]  <EOI>  [<ffffffff8128f900>] ? raw_local_irq_enable+0x10/0x12
[ 5172.307058]  [<ffffffff81290526>] acpi_idle_enter_c1+0x98/0xb6
[ 5172.307062]  [<ffffffff81394201>] cpuidle_idle_call+0x8b/0xe9
[ 5172.307066]  [<ffffffff81008325>] cpu_idle+0xaa/0xcc
[ 5172.307069]  [<ffffffff81451906>] rest_init+0x8a/0x8c
[ 5172.307074]  [<ffffffff81ba1c49>] start_kernel+0x40b/0x416
[ 5172.307077]  [<ffffffff81ba12c6>] x86_64_start_reservations+0xb1/0xb5
[ 5172.307080]  [<ffffffff81ba13c2>] x86_64_start_kernel+0xf8/0x107
[ 5172.307082] handlers:
[ 5172.307083] [<ffffffff81314106>] (ata_bmdma_interrupt+0x0/0x1a)
[ 5172.307088] [<ffffffff813335a4>] (usb_hcd_irq+0x0/0x7c)
[ 5172.307092] Disabling IRQ #18
[ 5200.736090] ata3: lost interrupt (Status 0x51)
[ 5200.736123] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 5200.736131] ata3.00: BMDMA stat 0x6, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0,
[ 5200.736140] ata3.00: failed command: READ DMA EXT
[ 5200.736155] ata3.00: cmd 25/00:00:7a:9d:29/00:01:2d:00:00/e0 tag 0 dma 131072 in
[ 5200.736158]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x24 (host bus error)
[ 5200.736166] ata3.00: status: { DRDY }
[ 5200.736189] ata3: soft resetting link
[ 5201.008176] ata3.00: configured for UDMA/133
[ 5201.008190] ata3.00: device reported invalid CHS sector 0
[ 5201.008217] ata3: EH complete
[ 5259.744199] ata3: lost interrupt (Status 0x51)
[ 5259.744235] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 5259.744244] ata3.00: BMDMA stat 0x6, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0,
[ 5259.744282] ata3.00: failed command: READ DMA EXT
[ 5259.744298] ata3.00: cmd 25/00:00:ba:15:62/00:02:2d:00:00/e0 tag 0 dma 262144 in
[ 5259.744301]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x24 (host bus error)
[ 5259.744310] ata3.00: status: { DRDY }
[ 5259.744335] ata3: soft resetting link
[ 5260.008298] ata3.00: configured for UDMA/133
[ 5260.008311] ata3.00: device reported invalid CHS sector 0
[ 5260.008337] ata3: EH complete

Expected results:

Suspend/Resume should not cause DMA errors.

Additional info:

Once issue has occurred a full power cycle won't fix the issue unless the Macbook is booted into OS-X before re-running fedora. Whilst we get DMA back on the reboot after a short period the above error messages will re-appear and we will loose DMA.

After the error has occurred the DMA issue persists across suspend/resume and we can't get DMA back without a power cycle

Comment 1 Steven Ellis 2011-01-26 01:46:36 UTC
Output of /proc/interrupts

           CPU0       CPU1       
  0:      55314      60166   IO-APIC-edge      timer
  8:          1          0   IO-APIC-edge      rtc0
  9:       7062       1161   IO-APIC-fasteoi   acpi
 16:     151300      11718   IO-APIC-fasteoi   uhci_hcd:usb4, uhci_hcd:usb5, eth1
 18:      19990       8744   IO-APIC-fasteoi   ata_piix, uhci_hcd:usb6
 19:          1          0   IO-APIC-fasteoi   firewire_ohci
 20:        241        243   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb3
 21:      28931      28675   IO-APIC-fasteoi   ata_piix, ehci_hcd:usb1, uhci_hcd:usb7
 40:          0          0   PCI-MSI-edge      pciehp
 41:          0          0   PCI-MSI-edge      pciehp
 42:          0          0   PCI-MSI-edge      pciehp
 43:       1801        838   PCI-MSI-edge      i915
 44:          1          0   PCI-MSI-edge      sky2@pci:0000:03:00.0
 45:       1631        117   PCI-MSI-edge      hda_intel
NMI:          0          0   Non-maskable interrupts
LOC:     124710     119642   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:          0          0   Performance monitoring interrupts
PND:          0          0   Performance pending work
RES:       6926       8436   Rescheduling interrupts
CAL:       2087       1766   Function call interrupts
TLB:       1006       1123   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:          4          4   Machine check polls
ERR:          1
MIS:          0

Comment 2 Steven Ellis 2011-01-26 01:47:17 UTC
Created attachment 475315 [details]
Dmidecode output

Comment 3 Steven Ellis 2011-01-26 01:47:58 UTC
hdparm output

dparm -i /dev/sda

/dev/sda:

 Model=WDC WD5000BEKT-00KA9T0, FwRev=01.01A01, SerialNo=WD-WXM1E60CC325
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
 BuffType=unknown, BuffSize=16384kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=976773168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=yes: unknown setting WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1,2,3,4,5,6,7

 * signifies the current active mode

Comment 4 Steven Ellis 2011-01-26 01:48:54 UTC
Created attachment 475316 [details]
Output of sdparm -a /dev/sda before the error occurs.

Comment 5 Steven Ellis 2011-01-26 01:49:47 UTC
Created attachment 475317 [details]
Output of smartctl -a /dev/sda before the issue occurs

Comment 6 Steven Ellis 2011-01-26 01:54:12 UTC
APM status of the disk before we suspend

[root@macdora steve]# hdparm -B /dev/sda

/dev/sda:
 APM_level	= 128

Comment 7 Steven Ellis 2011-01-26 01:56:07 UTC
Tried booting kernel with various combinations of irqpoll and noacpi neither of which resolved the issue.

Comment 8 Steven Ellis 2011-01-26 02:02:30 UTC
Had the same issue with a Seagate ST9500420ASG drive.

Using https://bugzilla.redhat.com/show_bug.cgi?id=549981 to try and trouble shoot this. Looks like a different problem.

Checking for NCQ which isn't enabled
cat /sys/block/sd[abc]/device/queue_depth
1


Output from lspci 
00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory Controller Hub (rev 03)
00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 03)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #5 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 03)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5 (rev 03)
00:1c.5 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 6 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev f3)
00:1f.0 ISA bridge: Intel Corporation 82801HEM (ICH8M) LPC Interface Controller (rev 03)
00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) IDE Controller (rev 03)
00:1f.2 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA IDE Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 03)
02:00.0 Network controller: Broadcom Corporation BCM4321 802.11a/b/g/n (rev 03)
03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8058 PCI-E Gigabit Ethernet Controller (rev 13)
04:03.0 FireWire (IEEE 1394): Agere Systems FW322/323 (rev 61)

Comment 9 Steven Ellis 2011-01-26 02:03:28 UTC
(In reply to comment #7)
> Tried booting kernel with various combinations of irqpoll and noacpi neither of
> which resolved the issue.

Based on https://bugzilla.redhat.com/show_bug.cgi?id=462425#c80 i actuall tried noapic. I didn't change acpi.

Comment 10 Steven Ellis 2011-01-26 03:39:39 UTC
Got a fresh trace after two suspend/resume events and plugging the laptop into mains

Jan 26 13:35:39 macdora kernel: [ 3754.946362] irq 18: nobody cared (try booting with the "irqpoll" option)
Jan 26 13:35:39 macdora kernel: [ 3754.946367] Pid: 0, comm: swapper Tainted: P            2.6.35.10-74.fc14.x86_64 #1
Jan 26 13:35:39 macdora kernel: [ 3754.946369] Call Trace:
Jan 26 13:35:39 macdora kernel: [ 3754.946371]  <IRQ>  [<ffffffff810a6fdb>] __report_bad_irq.clone.1+0x3d/0x8b
Jan 26 13:35:39 macdora kernel: [ 3754.946381]  [<ffffffff810a7143>] note_interrupt+0x11a/0x17f
Jan 26 13:35:39 macdora kernel: [ 3754.946384]  [<ffffffff810a7c23>] handle_fasteoi_irq+0xa8/0xce
Jan 26 13:35:39 macdora kernel: [ 3754.946388]  [<ffffffff8100c2ea>] handle_irq+0x88/0x90
Jan 26 13:35:39 macdora kernel: [ 3754.946392]  [<ffffffff8146fb44>] do_IRQ+0x5c/0xb4
Jan 26 13:35:39 macdora kernel: [ 3754.946396]  [<ffffffff8146a093>] ret_from_intr+0x0/0x11
Jan 26 13:35:39 macdora kernel: [ 3754.946397]  <EOI>  [<ffffffff8128f900>] ? raw_local_irq_enable+0x10/0x12
Jan 26 13:35:39 macdora kernel: [ 3754.946404]  [<ffffffff81290526>] acpi_idle_enter_c1+0x98/0xb6
Jan 26 13:35:39 macdora kernel: [ 3754.946408]  [<ffffffff81394201>] cpuidle_idle_call+0x8b/0xe9
Jan 26 13:35:39 macdora kernel: [ 3754.946412]  [<ffffffff81008325>] cpu_idle+0xaa/0xcc
Jan 26 13:35:39 macdora kernel: [ 3754.946416]  [<ffffffff81451906>] rest_init+0x8a/0x8c
Jan 26 13:35:39 macdora kernel: [ 3754.946420]  [<ffffffff81ba1c49>] start_kernel+0x40b/0x416
Jan 26 13:35:39 macdora kernel: [ 3754.946423]  [<ffffffff81ba12c6>] x86_64_start_reservations+0xb1/0xb5
Jan 26 13:35:39 macdora kernel: [ 3754.946426]  [<ffffffff81ba13c2>] x86_64_start_kernel+0xf8/0x107
Jan 26 13:35:39 macdora kernel: [ 3754.946428] handlers:
Jan 26 13:35:39 macdora kernel: [ 3754.946430] [<ffffffff81314106>] (ata_bmdma_interrupt+0x0/0x1a)
Jan 26 13:35:39 macdora kernel: [ 3754.946434] [<ffffffff813335a4>] (usb_hcd_irq+0x0/0x7c)
Jan 26 13:35:39 macdora kernel: [ 3754.946438] Disabling IRQ #18
Jan 26 13:36:08 macdora kernel: [ 3783.776065] ata3: lost interrupt (Status 0x51)
Jan 26 13:36:08 macdora kernel: [ 3783.776091] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 26 13:36:08 macdora kernel: [ 3783.776095] ata3.00: BMDMA stat 0x6, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0,
Jan 26 13:36:08 macdora kernel: [ 3783.776102] ata3.00: failed command: READ DMA EXT
Jan 26 13:36:08 macdora kernel: [ 3783.776112] ata3.00: cmd 25/00:00:b2:a8:9f/00:01:2e:00:00/e0 tag 0 dma 131072 in
Jan 26 13:36:08 macdora kernel: [ 3783.776119]          res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x24 (host bus error)
Jan 26 13:36:08 macdora kernel: [ 3783.776122] ata3.00: status: { DRDY }
Jan 26 13:36:08 macdora kernel: [ 3783.776133] ata3: soft resetting link
Jan 26 13:36:08 macdora kernel: [ 3784.046178] ata3.00: configured for UDMA/133
Jan 26 13:36:08 macdora kernel: [ 3784.046193] ata3: EH complete
Jan 26 13:37:10 macdora kernel: [ 3846.708082] ata3: lost interrupt (Status 0x51)
Jan 26 13:37:10 macdora kernel: [ 3846.708110] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 26 13:37:10 macdora kernel: [ 3846.708115] ata3.00: BMDMA stat 0x6, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0, BMDMA stat 0x0,
Jan 26 13:37:10 macdora kernel: [ 3846.708122] ata3.00: failed command: READ DMA EXT
Jan 26 13:37:10 macdora kernel: [ 3846.708132] ata3.00: cmd 25/00:00:32:3a:a6/00:02:2e:00:00/e0 tag 0 dma 262144 in
Jan 26 13:37:10 macdora kernel: [ 3846.708134]          res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x24 (host bus error)
Jan 26 13:37:10 macdora kernel: [ 3846.708139] ata3.00: status: { DRDY }
Jan 26 13:37:10 macdora kernel: [ 3846.708157] ata3: soft resetting link
Jan 26 13:37:11 macdora kernel: [ 3846.963154] ata3.00: configured for UDMA/133
Jan 26 13:37:11 macdora kernel: [ 3846.963172] ata3: EH complete

Comment 11 Steven Ellis 2011-01-26 04:00:12 UTC
Similar issue under Ubuntu
 * https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/664400

Comment 12 Fedora End Of Life 2012-08-16 13:44:27 UTC
This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping


Note You need to log in before you can comment on or make changes to this bug.