Bug 1161899 - SATA bus error on resume from hibernation, machine unresponsive, requires reboot to recover
Summary: SATA bus error on resume from hibernation, machine unresponsive, requires reb...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 20
Hardware: ia64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-11-09 02:46 UTC by Dimitris
Modified: 2014-12-03 06:01 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-12-03 06:01:05 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Dimitris 2014-11-09 02:46:46 UTC
Description of problem:

- Hibernate laptop.
- Power on again.
- Laptop appears to resume OK.
- While doing something on a terminal, e.g. running "yum update", the machine becomes unresponsive and the disk LED is almost solid on.
- In those occasions I see this in the log:

Nov 08 16:57:51 gaspode kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000 action 0x6 frozen
Nov 08 16:57:51 gaspode kernel: ata1: SError: { CommWake }
Nov 08 16:57:51 gaspode kernel: ata1.00: failed command: FLUSH CACHE EXT
Nov 08 16:57:51 gaspode kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 20
                                         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 08 16:57:51 gaspode kernel: ata1.00: status: { DRDY }
Nov 08 16:57:51 gaspode kernel: ata1: hard resetting link
Nov 08 16:57:51 gaspode kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 08 16:57:51 gaspode kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
Nov 08 16:57:51 gaspode kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Nov 08 16:57:51 gaspode kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
Nov 08 16:57:51 gaspode kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
Nov 08 16:57:51 gaspode kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Nov 08 16:57:51 gaspode kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
Nov 08 16:57:51 gaspode kernel: ata1.00: configured for UDMA/133
Nov 08 16:57:51 gaspode kernel: ata1.00: retrying FLUSH 0xea Emask 0x4
Nov 08 16:57:51 gaspode kernel: ata1.00: device reported invalid CHS sector 0
Nov 08 16:57:51 gaspode kernel: ata1: EH complete

I have to reboot the machine in order to stop it from periodically going unresponsive after that.

Version-Release number of selected component (if applicable):
Saw this only after upgrading to 3.16.7-200.fc20.  Just upgraded to 3.17.2-200.fc20, I'll update here if this happens again.

How reproducible:

Occasional.  I hibernate/resume several times a day, only had this happen twice in the last 3-4 days.

Steps to Reproduce:
1.  Hibernate then resume.


Additional info:

00:1f.2 SATA controller: Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode] (rev 03) (prog-if 01 [AHCI 1.0])
	Subsystem: Lenovo Device 20f8
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin B routed to IRQ 27
	Region 0: I/O ports at 1c48 [size=8]
	Region 1: I/O ports at 183c [size=4]
	Region 2: I/O ports at 1c40 [size=8]
	Region 3: I/O ports at 1838 [size=4]
	Region 4: I/O ports at 1c20 [size=32]
	Region 5: Memory at f2826000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: <access denied>
	Kernel driver in use: ahci

smartctl 6.2 2014-07-16 r3952 [x86_64-linux-3.17.2-200.fc20.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Intel 520 Series SSDs
Device Model:     INTEL SSDSC2CW240A3
Serial Number:    (...)
LU WWN Device Id: 5 5cd2e4 000039ecb
Firmware Version: 400i
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Nov  8 18:44:44 2014 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Comment 1 Dimitris 2014-11-09 02:49:02 UTC
Better lspci output:
00:1f.2 SATA controller: Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode] (rev 03) (prog-if 01 [AHCI 1.0])
	Subsystem: Lenovo Device 20f8
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin B routed to IRQ 27
	Region 0: I/O ports at 1c48 [size=8]
	Region 1: I/O ports at 183c [size=4]
	Region 2: I/O ports at 1c40 [size=8]
	Region 3: I/O ports at 1838 [size=4]
	Region 4: I/O ports at 1c20 [size=32]
	Region 5: Memory at f2826000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [80] MSI: Enable+ Count=1/16 Maskable- 64bit-
		Address: fee0200c  Data: 4172
	Capabilities: [70] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
	Capabilities: [b0] PCI Advanced Features
		AFCap: TP+ FLR+
		AFCtrl: FLR-
		AFStatus: TP-
	Kernel driver in use: ahci

Comment 2 Dimitris 2014-11-09 20:21:23 UTC
Just got this again with 3.17.2-200.fc20.x86_64:

Nov 09 12:16:22 gaspode kernel: ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000 action 0x6 frozen
Nov 09 12:16:22 gaspode kernel: ata1: SError: { CommWake }
Nov 09 12:16:22 gaspode kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Nov 09 12:16:22 gaspode kernel: ata1.00: cmd 61/00:00:18:1a:61/04:00:07:00:00/40 tag 0 ncq 524288 out
                                         res 40/00:01:00:00:00/00:00:00:00:00/e0 Emask 0x4 (timeout)
Nov 09 12:16:22 gaspode kernel: ata1.00: status: { DRDY }
Nov 09 12:16:22 gaspode kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Nov 09 12:16:22 gaspode kernel: ata1.00: cmd 61/08:08:18:1e:61/00:00:07:00:00/40 tag 1 ncq 4096 out
                                         res 40/00:1e:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Nov 09 12:16:22 gaspode kernel: ata1.00: status: { DRDY }
Nov 09 12:16:22 gaspode kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Nov 09 12:16:22 gaspode kernel: ata1.00: cmd 61/08:10:08:22:aa/00:00:1a:00:00/40 tag 2 ncq 4096 out
                                         res 40/00:01:00:00:00/00:00:00:00:00/e0 Emask 0x4 (timeout)

[... more similar entries,then: ]

Nov 09 12:16:22 gaspode kernel: ata1: hard resetting link
Nov 09 12:16:22 gaspode kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Nov 09 12:16:22 gaspode kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
Nov 09 12:16:22 gaspode kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Nov 09 12:16:22 gaspode kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
Nov 09 12:16:22 gaspode kernel: ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
Nov 09 12:16:22 gaspode kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
Nov 09 12:16:22 gaspode kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
Nov 09 12:16:22 gaspode kernel: ata1.00: configured for UDMA/133
Nov 09 12:16:22 gaspode kernel: ata1.00: device reported invalid CHS sector 0
Nov 09 12:16:22 gaspode kernel: ata1.00: device reported invalid CHS sector 0
Nov 09 12:16:22 gaspode kernel: ata1.00: device reported invalid CHS sector 0
Nov 09 12:16:22 gaspode kernel: ata1.00: device reported invalid CHS sector 0
[...]
Nov 09 12:16:22 gaspode kernel: ata1: EH complete

Comment 3 Dimitris 2014-11-12 03:09:07 UTC
Just to confirm the pattern after more incidents of this:  It looks unlikely to be a hardware problem because it consistently happens after resume.  Previous to resuming - including both for hours of heavy development use and solid reading in of the hibernation image *while resuming* - the disk behaves flawlessly.

Once, after resuming, it starts happening though, it gets pretty bad (machine effectively locks up every few minutes), requiring a reboot to clear up.

Some kind of power management race condition?  Should I play with ASPM boot options - or something else?

Comment 4 Dimitris 2014-11-14 17:57:03 UTC
I just saw this happen on a fresh boot.  So, maybe a drive problem after all.  I'm troubleshooting it with Intel, please leave this open for the time being; I should have some update in the next few days.

Comment 5 Dimitris 2014-12-03 06:01:05 UTC
This looked like a hardware failure; The drive has been replaced with an RMA unit which seems to have fixed this.  Closing.


Note You need to log in before you can comment on or make changes to this bug.