746453 – Modern Intel SATA chipset buggy

Bug 746453 - Modern Intel SATA chipset buggy

Summary: Modern Intel SATA chipset buggy

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	14
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-10-16 05:47 UTC by Eli Vaughan
Modified:	2011-10-17 16:02 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-10-17 14:49:31 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Eli Vaughan 2011-10-16 05:47:09 UTC

Description of problem:

Problem--

On (at least) the modern Intel Cougar Point SATA AHCI controller, the chipset support is at least buggy and can cause (at least) false drive failure reports. 

Detail of Problem--

I installed Fedora 14 (as I'm not liking 15 with gnome 3 yet) on my new motherboard which holds the above chipset. using 5 drives in md raid 5 (4 in the array, adding another) i began to grow my raid array on the new hardware, untested.  within seconds, the rebuild stopped and dmesg stated that 2 drives had failed (reset  sdc1 and sdd1). Thinking my data was gone forever, i ordered 4 new drives (2tb).  Began a new build of md raid 5 and within seconds 2 drives had failed (reset sdc1 and sdd1). moving these drives to another controller (marvell) on the same motherboard, with the same install, the array built just fine. no complaints of failed disks.  all cables had been checked and rechecked. i destroyed that array and built another just to be sure, failed on the intel chipset, built on the marvell without issue.  before RMAing the motherboard, i decided to make sure it wasn't a kernel issue, so i compiled a 3.0.4 kernel, with the fedora 14 kernel config file, making only the needed changes. i rebooted into the new kernel, and the array builds fine on the intel chipset. 



Version-Release number of selected component (if applicable):


How reproducible:

Seems very reproducible on my system. every time i attempted to construct raid 5 on the intel chipset, it failed within seconds saying the same 2 drives had failed (reset sdc1 and sdd1). 

Steps to Reproduce:

1.Using fedora stock kernel 2.6.35.14-97.fc14.x86_64 
2.Using any drives, old, new, known working, build an array (mine was raid 5)
3.Watch with horror?
  
Actual results:

Showed 2 drives failed with reset and were kicked from the array. 2 drives actually failing at exactly the same time, is pretty unlikely, however, of 4 new drives, to have them also show the same 2 drives (in relation to the controller and ports) fail, the odds are starting to get astronomical.

Expected results:

clean growth of the raid array?

Additional info:

lspci output of relevant chipset--

00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI Controller (rev 05)
03:00.0 SATA controller: Marvell Technology Group Ltd. Device 9120 (rev 12)


lspci -vv

00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI Controller (rev 05) (prog-if 01 [AHCI 1.0])
	Subsystem: ASRock Incorporation Device 1c02
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin B routed to IRQ 53
	Region 0: I/O ports at f070 [size=8]
	Region 1: I/O ports at f060 [size=4]
	Region 2: I/O ports at f050 [size=8]
	Region 3: I/O ports at f040 [size=4]
	Region 4: I/O ports at f020 [size=32]
	Region 5: Memory at fbf05000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee0100c  Data: 41c1
	Capabilities: [70] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
	Capabilities: [b0] PCI Advanced Features
		AFCap: TP+ FLR+
		AFCtrl: FLR-
		AFStatus: TP-
	Kernel driver in use: ahci

03:00.0 SATA controller: Marvell Technology Group Ltd. Device 9120 (rev 12) (prog-if 01 [AHCI 1.0])
	Subsystem: ASRock Incorporation Device 9120
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 54
	Region 0: I/O ports at d040 [size=8]
	Region 1: I/O ports at d030 [size=4]
	Region 2: I/O ports at d020 [size=8]
	Region 3: I/O ports at d010 [size=4]
	Region 4: I/O ports at d000 [size=16]
	Region 5: Memory at fbd10000 (32-bit, non-prefetchable) [size=2K]
	Expansion ROM at fbd00000 [disabled] [size=64K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee1100c  Data: 41c9
	Capabilities: [70] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Kernel driver in use: ahci



i wish i could get the actual dmesg output from the dmesg log history, but it is gone...  if i absolutely must, i can boot into the old kernel and reproduce it yet again to get the data if needed. but i think i reproduced it enough with 5 old drives known good, and 5 new drives, failures on stock fedora kernel, happy on 3.0.4 kernel.  

I am assigning this high Severity, as though, it doesnt seem to give issues if drives aren't being accessed, or are access with low volume, someone could possibly lose data on an array such as raid 5, possibly raid 10 if they were rebuilding or growing at the time. (and by someone, i mean me) :P

Since this does seem to be so reproducible right in front of me, i am willing to hit it with whatever you guys want me to do, as i have no data on my new drives, and i am not sure if i can recover anything on my old ones, they will stay in the closet for now. i can pull this and move them, or boot to the old kernel, and fail them, or whatever if you deem it a good move.

Comment 1 Josh Boyer 2011-10-17 14:49:31 UTC

At this point in the F14 lifecycle, we are not going to fix this in the 2.6.35 F14 kernel.  Since you have a working solution in the 3.0.4 kernel, we suggest you stick with that, or alternatively upgrade to F15 which is based on the 3.0.6 kernel (2.6.40.6).

Comment 2 Eli Vaughan 2011-10-17 16:01:40 UTC

i wasn't requesting a fix... i was only notifying of the issue, as it can result if data loss under the right circumstances.  How important this is to the redhat/fedora team is not my decision.  My choice for using fedora 14 is the fact that gnome 3 is an infant. (not be derogatory) my preference on desktop can stay out of this discussion, with the exception that, there might still be many people that would prefer using the last gnome 2 supported release of fedora... Hopefully, they will find this page, and know they have to build a new kernel for that.

Comment 3 Eli Vaughan 2011-10-17 16:02:56 UTC

edit--- if they have this chipset ^

Note You need to log in before you can comment on or make changes to this bug.